A Rose by Many Other Names
Anthony Rose, Tony Rose, Tony Rolls, Abdul Rose, Au-Yeung Mei Ro Rose: Any of these could be the actual names referred to by a database entry stored as "A. Rose." Given that every database contains faulty data, many other name referents also are likely, most of which may never be found when searching for "A. Rose."
This is a big problem. Not only is this true for security-related applications that guard against terrorism, but also for direct marketing operations that require vetting of potential customers, retaining the best customers and proper accounting of financial transactions. Moreover, direct marketers are concerned about the proper use of the names of their good customers, so as not to appear insensitive to the differences across names from different cultures.
Why do we have these intractable problems with personal names? The first reason is that there is no dictionary or authority to check. Names must be accepted as presented. However, the proper entry of personal name data is contingent on an extraordinary amount of education about names.
It is at the critical moment of data capture that we have the last chance to enter such names properly, yet it is also at this vital juncture that we paradoxically opt to minimize our attention.
The first attempt to mitigate the disparity across name renderings was patented in 1918. It is called Soundex and originally was designed to help with the analysis of the 1890 census data. Despite many studies illustrating the failings of this simple key-based approach to retrieving similar-looking names, it remains one of the most common techniques in use today. (Soundex comes standard with most database products.)
The continuing interest in Soundex and its many derivative forms of key-based technology results from two brutal facts: Names are complex, and computer programmers are under great pressure to do something to address the following issues:
· Spelling variations. Soundex focused on trying to neutralize certain spelling variations across separate name elements. For example, by using Soundex keys, Anderson and Andersen suddenly matched. But many other names that are clearly unrelated also were retrieved (Amaturk has the same Soundex code as Anderson).
· White space. The blank space is a huge problem for computer systems: Degomez and De Gomez do not match in most computer search systems. But Abd El Rahman and Abdurrahman are exact matches, as are Guanlu and Guan Lu. Knowing when and how to make these matches is hard without recognizing the cultural origin of a name.
· Syntax and name models. Even if neutralizing spelling variations within data fields can be effectively accomplished in your database and the applications that use it, an even more pernicious problem looms. Our standard model for names (first, middle, last) is not universal, and causes us tremendous problems with data entry, retrieval and data sharing. The most effective and only truly universal solution is to enter names into databases that have only two fields for names: given name and surname.
· Cultural issues with personal name data. How a culture changes names according to social customs - such as marriage and religious ceremonies - is another complicating factor that attends the entry of personal name data. These complex issues never occur in isolation. They compound each other, especially when names are transported across systems with different definitions. Yet, even within isolated systems, rampant problems exist that often have stayed hidden for decades.
New technologies are emerging called name-recognition software that are knowledge-based and encapsulate the way names work around the world.
For example, there are tools to identify the cultural classification and gender of personal names. Other tools take the form of character-oriented, name-searching engines that provide ranked search results based on linguistic and cultural variation patterns. There also are tools to rank search results based on similarities of pronunciations, not just similarities of spellings.
Name-hygiene tools enhance the consistency of name-data retrieval by applying culture-specific rules to names and identifying basic structural elements in a name. Variation tools generate a set of possible alternative Romanized spellings for names. Equivalence tools generate a table of how names appear across multiple cultures. There also are tools to help database users unlock the meanings of names and their spelling variations when transcribed to the Roman alphabet.
All these tools help direct marketers better understand their prospects and customers as well as better meet the needs of females, males or specific cultural groups within their customer base.
Though technology can help a lot, practicing data ecology is a first step for everyone. In its simplest form, data ecology is nothing more than an awareness of the environment for data and a commitment to ensure their initial value for future generations. Because accurate data are the lifeblood of every direct marketer, here are three fundamental tips for practicing data ecology:
· Data stewardship requires that every DMer who handles personal name data be aware of the fragility of this precious information. It is most critical at data entry, often the last chance that accurate data entry validation may be secured from the actual owner of the name. At every subsequent stage of processing and access, everyone is "downstream" of this vital operation.
· Every effort must be made to understand, calibrate and validate the exchange or distribution of personal name data. It is incumbent on those who deliver, as well as those who receive, such data to ensure the integrity of the process.
· Direct marketers should never change original data. Instead, administrators should provide a method for adding "See Also" entries - associated records for PN data that appear to be faulty. Doing so will keep the data ecosystem in balance.
By each direct marketer doing his part in an organization, the data environment can be improved exponentially. The byproduct of such an environment is marked improvements in customer interactions and security as well as long-term customer relationships.