Track the Integrity of 'Probabilistics'

Are you passionate about toothpaste?

Probably not, even though you recognize it is important and suspect there are significant differences among products. But imagine the scientists who work in toothpaste labs: They are masters of dentifrice minutiae and no doubt spend long hours in fierce debate over the merits of alternative approaches.

Data matching software is the same way. Most of us do not give much thought to the details of matching techniques, but the people who develop such systems have developed detailed taxonomies and dismissive critiques of opposing methods. The poor layman who tries to use these analyses to make an intelligent product selection is soon bewildered.

Integrity (Vality Technology Inc., 617/338-0300, www.vality.com) uses what the vendor calls “probabilistic” matching — a technique that considers rareness of data values as well as their similarity when judging the significance of a match. In other words, a match on a common name like “O'Neal” counts for less than a match on an unusual name like “Mutombo.” This is a pretty sexy concept, at least by the standards of matching software, and it does distinguish Integrity from competitive systems. But the focus on probabilistic techniques also obscures ways in which Integrity resembles other matching products — and some significant differences.

First, the similarities. Like all matching systems, Integrity accepts input records, parses them into specific elements such as last name and street number, standardizes entries to reduce variations such as nicknames and abbreviations, and then identifies groups of records that are similar enough to be worth comparing in detail. Integrity provides a broad range of capabilities in all these areas. It can read relational database tables as well as the usual flat files; it can perform a range of data transformations including string manipulation and table-based value substitutions; and it lets users apply multiple grouping rules without physically resorting the records.

Integrity also applies particularly sophisticated methods to parsing and standardization. Like other systems, it has large tables that list the roles typically played by common words: for example, “street,” “avenue,” “lane” and “road” are usually a “street type.” The problem is that many words can play several roles or appear in different orders: “Lane” could be a first or last name; “Avenue B” and “Fifth Avenue” are both real addresses.

Most systems resolve these ambiguities by assigning a role code to a word in a line and then looking for the resulting sequence of codes in a reference table; if they do not find a sequence using the most common role for each word, they may assign alternative codes and look for the sequence that results from that. This method is effective but requires very large tables to accommodate all the different potential sequences. Integrity instead works with smaller sequences that represent fragments of a name or address line, and looks for these within each line. This lets the system interpret a wide variety of sequences using relatively few standard patterns.

Once the elements are identified and standardized, Integrity compares each pair of records by assigning similarity scores to the corresponding elements, weighting these based on importance of each element, and adding the weighted scores to get a score for the pair as a whole. Record pairs with a total score above a specified threshold are considered a match; pairs with a score between this level and another, lower threshold are considered questionable and may be flagged for manual review. Like other systems, Integrity lets users specify which data elements to include and has different comparison methods for different data types.

The approach of adding similarity scores is used by some merge/purge products, but the majority of today's sophisticated matching systems actually use a different technique: They assign codes rather than scores for each element match, and then explicitly define the treatment (match, no match or review) for each combination of codes. This method gives the user precise control over how each situation is handled and makes it easier to understand results. While assigning treatments to many thousands of code combinations is burdensome, vendors provide pre-built tables to spare users much of the effort.

What Integrity gains from the score-based method is a greater ability to add nonstandard elements to a matching process. This is problematic for code-combination systems because each new element at least doubles the combinations that must be classified. Nor is it obvious without extensive testing exactly how each new combination should be treated. Integrity's score-based approach lends itself to automated data analysis techniques that can determine by themselves how different elements should be weighted and where the score thresholds should be set.

While users can and do override the system's recommendations, they give a reasonable starting point. Integrity further enhances its flexibility by providing comparison algorithms for data types that are not used in conventional name and address matching situations, such as geographic distance and dates.

In other words, the increased accuracy provided by “probabilistic” scoring is just one aspect of Integrity — and not necessarily the most important. While many portions of the system use conventional approaches, its reliance on statistical scores to classify matches gives it much greater flexibility than systems using the code combinations. Vality has exploited this by developing products that use the same underlying technology for other applications including address verification, geographic location, catalog searches and product classification.

Integrity comes with a Windows interface that lets users set up and execute data import, transformation, parsing, standardization, matching and output generation processes. The vendor provides default settings, word tables and rules for common tasks such as individual matching and householding.

The system also includes extensive reports to analyze input data, identify patterns and anomalies and report on match run results. Integrity runs on Windows, Unix, IBM mainframe and AS/400 servers and supports both batch and online processing. Pricing is based largely on the server and ranges from $50,000 to $200,000.