Handling Missing Data Problems

With the proliferation of data mining algorithms, analysts still find themselves in a dilemma when it comes to dealing with missing data in the customer database. This critical issue, if not managed correctly, can blemish the best modeling efforts and also undercut a well-conceived strategic program.

For example, consider a situation in which five of eight records of account numbers contain the average age of the account holders, but three do not: Account No. 1 has an average age of 34; No. 2 is 41; No. 3 is missing; No. 4 is 40; No. 5 is 37; No. 6 is missing; No. 7 is 43; and No. 8 is missing.

Disregarding the missing data, you calculate an average of 39 for the five full records. With this simple profile of his database, a marketer might consider developing creative and offer to fit this mind-set. Yet assuming the real values of the missing age data are 69, 68 and 68 in the three less-than-full records, you now compute an average of 50. Perhaps now try a different creative? Would you provide the same offer to your audience?

In most scenarios, the analyst confronts a situation in which data is missing for some individuals, and a different set of data is unavailable for yet another group of records. Several approaches have been suggested in negotiating missing data. None of them are perfect. These procedures include:

• Convert all missing data to a zero value. Simply place a zero wherever you encounter a missing data item. Avoid this approach. It is simple to do, but most analysts would agree that there is a significant analytic problem. How does a professional distinguish between a real value of zero and a substituted value?

• Compute the average of the data field and insert the average wherever the data is missing. This is probably the most often used solution. If, however, many of your records have a missing value for the same data field, this approach would force all of the missing data items to have the same value. This could complicate profiling. If fewer records have missing values, this is a viable technique.

• Compute averages by some logical segment. Similar to the previous method, you compute averages – but this time you compute averages by some logical segment. You may use gender as a segment, for example. An average is calculated for male while another average is computed for female. These averages would then be substituted for missing data depending on the gender of the record. The assumption here is that an average closer to the real missing data value can be calculated when splitting averages by segments.

• Calculate the median and insert the median value for the missing data items. A frequently used approach, this practice moderates the problem that often occurs when averages are used on skewed populations. For example, the average of the following numbers – 18, 27, 29, 36 and 1,500 – is 322. The median is 29. If these were monthly insurance premiums, you could make an argument that the average is not consistent with most of the values. The median, however, appears more appropriate.

• Delete the records that have missing data items. If only a few of the records are deleted, this works fine. However, if all of the deleted records represent a distinct profile or mind-set, the marketer may be missing out on leveraging this unique population. For example, it is typically more difficult appending wealth indicators to higher income individuals. The data arrives missing. Many marketers do not want to delete this segment from consideration.

• Ignore the data elements that appear to be missing for many individuals. If most of your records are missing a particular data item, and you cannot obtain it easily or readily, the best thing to do is disregard this piece of data from further analysis.

• Develop a separate model for the missing data items. I have seen this work well. Be careful, however. The resulting prediction from this model should be sufficiently robust to provide meaningful results.

• Develop separate models. This is a viable, but infrequently used method. Records with nonmissing data are subjected to one model, and the other records are input into another model. Perhaps the incremental resources needed to develop a second model make this approach a seldom-used one. This, however, can be an excellent solution.

• Develop a flag indicating when a piece of data is missing. Missing data itself may be meaningful. You can create an indicator signifying the presence of missing data for a particular record. The indicator is turned off when the data is present. This flag becomes a potential predictor. This is particularly useful when dealing with wealth-related marketing segmentation. It is often more difficult to secure data for wealthier individuals.

• Use software that handles missing data. Software exists that will fill in, behind the scenes, the blanks for missing data. These programs are available either as part of a data-mining suite, or as a stand-alone program. Make sure that you know what is going on and how the missing data is being managed.

Successful data mining requires the completion of many tasks. Some problems cannot be solved easily, if at all. Missing data, however, is not one of these. There are viable approaches to handle the missing data predicament, depending on your firm’s resources, needs and circumstances. Ignoring the problem with the expectation that your results will not be affected is incorrect.

• Sam Koslowsky is vice president of strategic analytics at Harte-Hanks Inc., San Antonio. His e-mail address is [email protected]

Related Posts