DataMentors' DataFuse Offers Precision, Flexibility
The common attitude is that all matching systems give roughly the same results, so detailed evaluation is not worth the effort.
It is an understandable mistake. Today's first-rank matching systems, from vendors including Trillium Digital Systems, Group 1, i.d.Centric and Innovative Systems, all take the same general approach: They use key word and pattern tables to split each name-address record into elements, reassemble those elements in a standard format and link records that match on specified combinations of elements.
But many other systems use match-code or statistical techniques that are much less effective because they lack the knowledge built into massive word and pattern tables. And even among table-based systems, subtle differences yield slightly different results. Small differences matter when millions of names are involved.
DataFuse (DataMentors, 813/960-7800, www.datamentors.com) is a new table-based matching system. While similar to its peers, DataFuse offers unusual precision. For example, it can treat the same word differently depending on whether it appears at the beginning, middle or end of a line. The system also provides great flexibility, allowing users to include any number of elements in a matching rule and to control the sensitivity of the rules themselves.
The result, the vendor says, is significantly greater accuracy than other products. In this context, a significant difference is rather small: DataMentors points proudly to a test where it changed 6 percent of the households identified by another product. The net improvement may have been less because not every change was necessarily correct.
DataFuse works in a five-step process. The first step is to identify the type of data on each line of a name-address record. The system uses word tables to decide whether a line should be suppressed, contains a street name or contains a city or state.
The city-state table holds common misspellings, abbreviations or variants of geographic names and will replace these with a standard version to improve later matching. This step also applies existing linkages, such as a customer ID, eliminates common false values such as a date of 11/11/11 and applies no-mail indicators based on missing address data or key words such as "deceased."
The second step splits the name line into elements such as first name and last name. Using at least six separate tables, the system first standardizes or deletes common words and phrases. Then, it codes each word as commercial (corporation, marketing), a specific type such as title (Mr., Mrs.) or a generic type such as mixed alphanumeric. It also can assign actions, such as ignoring the word and whatever word is next. The sequence of codes is then found in a table that defines how each word is treated.
For example, "John and Jane Smith" might be coded "FRFA." The FRFA table entry might treat the first and third words as first names, treat the fourth word as a last name and create two separate name lines: "John Smith" and "Jane Smith." Once name parts are identified, the system applies a gender table to the first names to assign gender codes.
The third step applies a similar process to address lines: It standardizes and codes each word on the line, looks up the code sequence in a table and assigns the element types as the table specifies. DataFuse can then call third-party postal software to apply U.S. ZIP+4 codes and CASS standardization. Tables can be modified to support international names and addresses, but all records for a given file use the same tables.
The fourth step is data matching. Here DataFuse offers almost total flexibility. There are more than 20 matching methods such as phonetic comparisons and string comparisons. Each returns a score indicating the similarity between two elements. Users can combine these methods into rules that specify which elements are compared, which method is applied to each element and what score qualifies as an element match. Multiple rules can be applied to the same file to let different combinations of element matches qualify as a record match.
For example, a user might want two records to match if they have either the same house number and street, or the same P.O. box and ZIP code. Like most matching systems, DataFuse does not compare each record to all others. Instead, the user chooses a sort sequence and how many adjacent records will be compared. Users can sort one file several ways and apply different matching rules to each sort.
The final step identifies the primary record in each match group, performs calculations such as profitability coding or decile assignments and optionally applies geo-demographic information using third-party software. DataFuse has a powerful scripting language for such calculations. This scripting language is also used for tasks such as saving previous household ID codes to trace additions and deletions to a household over time.
Output of DataFuse is a flat file with the coded records. The system also stores which rule caused each match and provides summary reports to show the effect of each rule. Rules and other processes are defined by writing script files, though DataMentors plans to release a graphical user interface later this year.
DataFuse was introduced in early 2000. It has four installations, including two service vendors that use the system for multiple clients. The system runs on Windows NT and can process about 200,000 records per hour, depending on hardware and complexity of rules. This is roughly comparable to competitive products. Pricing is based on number of records processed and begins at $50,000.