Application Knows Just the Right Word

Share this article:
Unstructured data management is a very broad term. Most applications involve written texts, but the field also includes sound, images, maps, video and instrumentation streams. Major text management functions include classification (assigning texts to categories), search (finding documents related to specified topics), extraction (identifying facts within documents) and profiling (identifying people's interests).

No one product performs all the possible functions, though a few major vendors try. Understanding this context is important when assessing unstructured data management software. Most products perform just one major function and often are limited to an even narrower sub-specialty. This means any assessment must consider both how well a product performs its intended function and how easily it fits into a complete solution.

WordStat ovalis Research, 514/899-1672, software, but has its limits. Mostly, it does something that seems simple: identify words within a document. This is a fundamental prerequisite for more advanced activities such as classification and search. And it turns out to be not so simple after all.

The challenge is twofold. First, one word can take many forms. In English, verbs change based on subject and tense; nouns have singular and plural forms; adjectives and adverbs apply suffixes to common roots. Other languages can be even more complicated.

The practical issue is that text analysis works poorly unless these variations are removed. For example, searching simply for the word "Canada" would not return documents with the word "Canadian," though these probably would be relevant to a request. Linguists have developed standard techniques to handle these issues, giving them cool names like stemming and lemmatization (conversion to a root form, or lemma). WordStat performs these transformations automatically.

The second challenge is that different words have related meanings. "Angry," "mad" and "furious" share a root meaning of "annoyance." "Dog," "cat" and "hamster" are types of mammals as well as pets. Meaningful text analysis needs awareness of these relationships. This can be done only through dictionaries that link words to concepts. WordStat includes several public domain dictionaries and thesauri plus tools to customize these with a user's own vocabulary.

Though WordStat includes impressively advanced analytical functions, its dictionary-building features are arguably the most important. Good dictionaries are the foundation of most text analysis, and building and maintaining dictionaries can be the largest part of a text analysis project.

WordStat makes this about as efficient as possible, with a graphical user interface that lets users assign words to categories, build hierarchical category structures, distinguish among different senses for the same word, use fragments and wildcards to remove variations, define phrases to treat as a single word, specify words to include or exclude from an analysis and identify frequently or infrequently used words as candidates for special attention.

A particularly helpful feature called "keyword in context" can display the text surrounding each occurrence of a specified word so users can see how the word is being used.

Dictionaries are built and applied to either single documents or sets of cases. These can be imported from spreadsheets, text files or several word processing formats. A case can have multiple data elements including text, numeric and categorical variables. Users can view, edit and code text from within the system.

WordStat's analytical capabilities build on its core function of word identification. The simplest analysis is a frequency report, which shows how often each word or concept occurs. Inclusion and exclusion dictionaries can limit the analysis to words of interest. At the next level of complexity are matrix reports, which can count the number of cases containing each word, the number of cases containing different pairs of words or the frequency of each word in cases with other characteristics.

This last type of report, unusual for a text analysis system, can show how word frequencies vary among authors or for the same author over time. Results can be displayed as counts or statistical measures and can be viewed in tables or several types of graphs.

WordStat provides even more sophisticated analyses, including clustering identify similar words or cases; proximity maps to show distance between one word or concept and others; concept maps to show relative positions of many words; and heat maps to illustrate relationships between words and independent variables.

Clustering could help extend the dictionary by identifying likely categories for new words, but is not reliable enough to be a fully automated solution. Clustering also could provide a limited form of document categorization. More precise categorization, based on training with previously categorized cases, is planned for the next release, by year's end.

What's the catch? Certainly not price: WordStat is an astonishing bargain at $595. This includes a copy of SimStat, a robust, general-purpose statistical package that WordStat uses as a base. Nor is it scalability: The software runs on Windows workstations and has been tested with up to several hundred thousand cases. And the system is relatively mature, having sold 300 copies since its introduction in 1998. SimStat is even better established, with sales of nearly 4,000 copies since 1989.

The problem is integration. WordStat dictionaries are stored in a specially formatted text file that is viewable, but not designed for external access. Nor can the system automatically load a document or set of cases, identify the words and concepts, and output the results. At best, the user could import the data, run a word frequency report and save the output into a database. This is adequate for periodic research projects, but not automatic routing of e-mail inquiries.

Happily, Provalis is addressing both these issues. The next release will be able to export dictionaries in an XML format, making them easily readable by other systems. It also will allow automated processing of individual documents or case files. Once these capabilities are added, WordStat may transform from an impressive but isolated text analysis tool into a valuable part of a true production system.

This material may not be published, broadcast, rewritten or redistributed in any form without prior authorization. Your use of this website constitutes acceptance of Haymarket Media's Privacy Policy and Terms & Conditions