Hitmetrix - User behavior analytics & recording

2 Advanced Name-Matching Systems

Name-matching software has graduated from avoiding duplicate catalogs to helping the government watch for terrorists. In this environment, any error can be devastating: A missed match can allow a fatal attack, while a false alarm can disrupt the life of an innocent person.

Unfortunately, both errors will occur. No software is infallible. The good news is that the software used for surveillance is considerably more sophisticated than the merge/purge systems familiar to most direct marketers.

Many of these products have their roots in work previously done for law enforcement or intelligence agencies. Others were developed for commercial applications, such as consolidating customer records. But whatever their origins, all matching systems must perform two key tasks: selecting records to compare, and determining which records match.

Here are two products that take different approaches:

NameSearch (Intelligent Search Technology, 800/287-0987, www.intelligentsearch.com) focuses on the selection problem. The core of the system is an ability to generate sort keys that bring together the records most likely to match.

The first step is to clean the input records by removing extraneous words and characters, standardizing multi-word phrases and replacing nicknames and diminutives with standard forms. This sort of processing is performed by nearly every matching system. Like the others, NameSearch relies on tables and rules that specify how to handle particular words and phrases.

Recognizing that different rules apply in different cultures, NameSearch has separate sets for Anglo, European and Middle Eastern names. A graphical interface lets users modify the rules as desired.

The second step in the key building process is to replace the name with a phonetic equivalent. This is a strength of NameSearch, which uses phoneticization techniques designed to be superior to the common Soundex and NYSIIS algorithms.

Phoneticization is applied extensively to the least common names while common names are lightly phoneticized. This lets the system find as many variations of uncommon names as possible while limiting the number of candidates returned for matches against common names. Common names are listed in frequency tables provided by the vendor.

Users can modify these tables or run a utility that calculates frequencies within a particular set of input. Users also can choose among three phoneticization routines, which provide varying degrees of precision.

The keys themselves are created by stringing together the standardized, phoneticized name elements. NameSearch typically generates multiple keys by arranging the elements in different sequences. This lets it automatically find matches when elements appear in different orders on different records, such as Smith, John vs. John Smith.

Once NameSearch keys are generated for a set of records, match candidates can be identified by specifying a range of key values. NameSearch can generate multiple ranges for a single input record, representing increasingly broad searches. But it’s up the user to write the programs that actually find and extract the specified records.

NameSearch does provide a half-dozen comparison routines that return a score to indicate the likelihood that two records match. These also rely on rules and phonetic comparisons and give some control to the user. Again, the user must build a supporting system to make use of the match scores once they are generated.

DataLever (DataLever Corp., 303/546-7943, www.datalever.com) differs greatly. It provides a complete data manipulation environment, with tools to extract data, analyze file contents, make changes, parse, standardize, index, geocode and generate reports, in addition to finding matches.

Users combine these tasks into projects using a graphical flow chart. A new server module will let users schedule projects for automated execution and provide a central repository to share project components.

Within the matching process itself, DataLever focuses mainly on sophisticated comparisons. Selection of names is simple: the user specifies a sort sequence and the number of adjacent records to test.

In contrast, the comparison process involves detailed evaluation of individual fields, which itself requires that the data has been accurately standardized and parsed. DataLever includes sophisticated tools for the standardization, parsing and comparison functions.

Let’s start at the beginning. Matching in DataLever is treated like any other project, by building a process flow using standard system tools. Some of these tools, including the standardizer, parser and matcher, are themselves constructed with standard DataLever functions – meaning that users can examine and modify them if desired.

The standardizer handles both name and address data, including postal standardization for the United States and Canada. The parser converts text to a sequence of word types, then uses a pattern table to identify specific data elements.

For example, it might read “J and M James” as the word-type sequence “single letter, conjunction, single letter, name.” It then would find this sequence in the pattern table, which might interpret it as “first initial, conjunction, spousal first initial, family name.”

The system comes with standard tables of patterns, name, company and address words. The system can identify records that do not match an existing pattern, so users can create a new pattern if appropriate.

Once the data is standardized and parsed, it is sorted and fed to the matching process. This process relies heavily on comparisons between individual data elements, which is why accurate parsing is so important.

Users specify the elements to include, and for each element specify one of four comparison methods, a threshold score to qualify as a match, a weight assigned to the element score and an error penalty if the threshold is not met.

Element scores are combined into a total score, which determines whether the record pair is considered a match. Users can specify multiple match rules and different sort sequences. DataLever provides templates for consumer, business and business-contact matches.

When matching is complete, DataLever again can use its standard capabilities to combine overlapping match sets, consolidate data from matching records, output all pairs or only survivors, list marginal matches for manual review, generate reports and perform other types of processing. In addition to running as independent processes, DataLever functions can be embedded in other software.

Related Posts