ChoiceMaker Shows Matching Trends
For example, the original Soundex algorithm, designed to overcome spelling variations by building phonetic name indexes, was patented in 1918 and used extensively in setting up the original Social Security system files in the 1930s.
Government agencies continued to develop matching systems independently of direct marketers. Apart from the prosaic reason that the two groups had little contact, subtle but significant differences exist in their requirements.
Merge/purge systems were developed mainly to pool names for mailing lists. Since the price of an error was the cost of a duplicate mail piece, accuracy could be compromised to gain speed and efficiency. Government systems typically were used to search for individuals in a single existing file, whether of criminal suspects, immigration documents or tax records. Accuracy and real-time response were much more important; handling big batch jobs with disparate sources was not.
But applications such as enterprise-wide customer relationship management require functions more similar to governmental search systems than to traditional merge/purge. So it's not surprising to see an increasing number of commercial products with origins in government applications, as well as a rising number of products with both types of users. Nor, given the higher priority placed on accuracy, is it surprising to see technical innovations that promise more reliable results.
ChoiceMaker 2.1 (ChoiceMaker Technologies, 646/336-4441, www.choicemaker.com) illustrates these trends. The system originally was developed to help the New York City Department of Health find duplicates in its registry of children's immunization records. It since has been sold to commercial as well as other government clients. And it employs technology that, according to the vendor, has proven more accurate than competitors in several head-to-head tests.
Actually, ChoiceMaker combines several innovative technologies. At the lowest level, the system is written in Java, which lets it run on nearly any hardware and connect to nearly any data source. Inputs are defined with a schema that not only identifies the available fields, but also can specify relationships across data tables, incorporate validity checks, parse entries into separate elements and create derived values such as Soundex codes.
Processing rules are written in Java or in ChoiceMaker's own ClueMaker language. ClueMaker extends Java with specialized matching functions such as field swaps (e.g., comparing first name in one record against last name in another record) and data stacking (allowing multiple values in a field, such as old and new address). ClueMaker statements are automatically converted into Java for execution.
ChoiceMaker uses this technology to read, parse and standardize input in fairly conventional fashion. The processed data is then stored in a reference table. When a new record is presented for matching, the system selects records from this table for comparison. Like other systems, ChoiceMaker limits this selection to records similar enough to be potential matches.
ChoiceMaker adjusts the selection based on the distinctiveness of the input: For an unusual name like Guardado, all records with the same name may be returned; for a common name like Nelson, the selection might be restricted to matches on name plus ZIP code. The fields to use in these selections and the maximum number of names to return for each search are specified during system setup. The determination of how many selection criteria are needed is made automatically by the system, using precalculation statistics on the frequency of different values within the reference table. A few other matching systems use similar techniques, but most matching software is much less advanced.
Once the candidate records are returned, ChoiceMaker matches these against the input. This is the most unusual, and sophisticated, aspect of ChoiceMaker. The system first evaluates "clues" that indicate whether records match or differ: same first name, phonetically similar last name, different birth years and so on. These clues are written in ClueMaker and can be complex: for example, checking whether a pair of records contains one address in the Midwest or Northeast and another in Florida or Arizona, to find people who head south for the winter.
Clues may yield "match," "differ" or no result if appropriate data is unavailable. Where gradation is appropriate, such as degree of near match or match on common name vs. match on unusual name, separate clues are created for each level. This is part of the reason a typical installation uses 200 clues against many fewer data elements.
The system must combine the individual clue results to reach a decision. ChoiceMaker does this by assigning statistical weights to the clues and comparing the combined weights of the "match" clues vs. "differ" clues. Record pairs with a clear result are classified automatically; others can be flagged for manual review.
The weights are determined using a machine learning technique called "maximum entropy modeling." This involves submitting several thousand records with matches already marked; the system then automatically derives the set of weights that most closely predicts the marked matches. Such automated training is unusual in the world of matching software: Even the most advanced systems typically rely on users to manually refine match rules by looking at missed or false matches and making adjustments.
Of course, ChoiceMaker still requires human effort: to define input data, specify parsing and standardization rules, build new clues, create test cases and review results. The vendor says it takes about two weeks of labor to set up a sophisticated matching process. Whether this is more or less than other systems would depend on the circumstances. For unusual matching problems, ChoiceMaker probably would have an advantage. The system includes several tools to help with development, but considerable expertise is still required.
ChoiceMaker originally was developed in 1998 and has several current installations. Pricing depends on the application, ranging from $7,500 for a development license to hundreds of thousands of dollars for a large implementation.