Data warehouse experts used to joke about “write-only” databases – systems that were useless because it was impossible to access their contents. (OK, it isn’t much of a joke – there’s a reason these people are technologists, not comedians.) Happily, the data warehouse industry has evolved tools and techniques to overcome most of its data access problems.
But data warehouses work with highly structured data, stored in the records and fields of conventional files or the rows and columns of relational databases. The world also contains huge amounts of unstructured text in word processing files, e-mail messages, Web sites, spreadsheets and presentations. Accessing this type of data poses a different set of challenges. (Non-text data, such as sound and image files, is yet another issue.)
The central challenge in managing text data is applying structure. This can be applied to the documents themselves, by assigning them to categories: for example, news articles mentioning Bill Gates. Structure also can be applied to information within each document, by extracting specific facts: for example, Bill Gates is married to Melinda Gates.
Document classification generally uses statistical techniques to identify the characteristics of documents in each category; new documents are then classified by measuring how closely they match these characteristics. Extraction systems typically apply semantic analysis, which uses sentence structures and word definitions, to identify specific information.
Each method has advantages. Statistical techniques are fast, language-independent and can automatically identify new categories or concepts when documents do not fit an existing pattern. Semantic methods require preliminary effort to identify the rules and vocabulary of each language, but give more precise results.
Vendors tend to focus on one method or the other, and feel strongly that their choice is superior. From a user’s perspective, it’s more important to evaluate individual systems against specific requirements than to look just at general techniques.
Certainly both types of systems can perform document classification, which is the core capability of any text analysis system. Classification enables many applications: searching for and retrieving documents on a topic; picking the most suitable reply to an inquiry; generating or extracting summary information about a document; alerting users when new information appears on a topic of interest.
Both types of systems also can generate and manage taxonomies, which are structures that define relationships among the categories themselves. These make it easier for users to navigate a body of data by providing a map of its contents. Often more than one taxonomy applies: for example, a collection of business news articles might be classified independently by geography, industry, company, date and topic.
Most text analysis systems can generate taxonomies automatically, though results typically would be reviewed and refined by human experts. In practice, automated taxonomy generation is probably less common than starting with a prebuilt taxonomy that reflects established ways of viewing a topic.
Once a taxonomy is established, the classification mechanism is typically trained by providing examples of documents known to belong to each category. The system then identifies characteristics these documents have in common – that is, it develops models that predict whether a given document will belong to a particular category. These models are applied to new documents as they are submitted for categorization.
Text analysis displays the typical characteristics of an emerging industry. There are dozens of small firms, none with a dominant market position and each arguing for the technical superiority of its approach. Product configurations vary widely, from suites with a broad range of capabilities to point solutions performing a single function.
Few standards are shared by different systems, though XML is commonly used for category tags. Implementations remain mostly limited to specialized tasks; for text analysis, these include Web search, personalized news reporting and, lately, anti-terrorism surveillance. Other than early adopters, few potential users understand the basic nature or value of the products.
Autonomy (Autonomy Corp., 415/243-9955, www.autonomy.com) is one of the largest text analysis vendors – though with annual revenue around $50 million, it is far from huge. It uses Bayesian statistics to identify word patterns that signify concepts within documents.
But, following the classic strategy of an early leader in an emerging industry, Autonomy positions itself more broadly, as providing an “infrastructure” that makes unstructured data available throughout the enterprise. It supports this claim with connectors and applets that display Autonomy outputs within third-party systems.
Perhaps the most interesting example is ActiveKnowledge, which automatically displays a list of documents related to whatever the user is viewing in a third-party application. Autonomy also can provide external applications with profiles of user interests, personalized news feeds or alerts, lists of people with similar interests or expertise in a field and keyword as well as concept-based searches.
Other enterprise-level features include connections to more than 200 data types, sophisticated integration with external security systems to control document access and classification speeds measured in thousands of documents per second.
Because Autonomy uses statistical rather than semantic techniques, it is language-independent and can automatically identify new concepts as these start appearing in new documents. These capabilities make it particularly suited for surveillance applications like monitoring telephone conversations.
Pricing of Autonomy depends on the number of users and system functions. A typical large installation costs $400,000. The product was first released in 1996. It since has been sold to more than 600 clients and embedded in software from more than 50 other vendors.