Selecting the Right Predictive ModelProspect name predictive models come in many flavors. "Stimulus/response" models, for example, are built directly off promotional files and corresponding responder information. "Look-alike" or "clone" models are not based on promotional information. Instead, they evaluate how similar in appearance each prospect is to a direct marketer's customers, in an indirect attempt to capture the dynamics of response behavior.
All else being equal, stimulus/response models are superior to look-alike models. So if robust promotional and responder information is available, it should be used. But look-alike models have their place, especially in start-up situations such as when a new channel is being incorporated into the sales process.
Differences in granularity. Both stimulus/response and look-alike models have differences in the level of demographic overlay data used to create the independent or "predictor" variables.
Generally, the more granular the data, the more predictive it is. Therefore, all else being equal, the most granular available data should be used to build a model. But nothing is ever equal in prospect modeling.
The complication of missing data. Gaps in coverage are an important issue with individual/household data. Except for age, length of residence and estimated income, coverage rarely exceeds about 75 percent. For self-reported income, coverage generally is about 35 percent. With estimated income, the high coverage is possible only because this element is - in its own right - the result of a predictive model. Information such as car ownership and neighborhood census characteristics is interrogated to create these estimates.
Incremental power vs. incremental cost. Generally, the more granular the data, the higher its price. A financial evaluation should be performed for each level of granularity to determine whether its incremental predictive power more than offsets its incremental cost. The evaluation compares multiple versions of the model. Assume, for example, that the available levels of overlay data are block group and individual/household:
· First, a model would be built off block group-level demographic variables; that is, inexpensive data with relatively modest incremental power.
· Then, a second model would be constructed off the above plus indivi-dual/household-level demographics; that is, relatively expensive overlay data would be added, with potentially significant incremental power.
This determines whether any individual/household data elements are cost effective. If it makes sense to incorporate such elements in the final model, the number will be minimized. This produces the most economical possible model.
The hidden cost of volatility. When performing an incremental financial evaluation, one must be mindful of the potential volatility of highly granular sources such as individual/household and ZIP+4-level data. Volatility often results in premature model degradation. This difficult-to-quantify hidden cost can be counteracted only by more frequent model builds. Individual/household data is a particular challenge.
Volatility occurs because of changes in the underlying data sources. Sometimes, a data compiler replaces one or more original sources, in whole or in part. Other times, a source is pulled off the market because of legislation, privacy concerns or other reasons. This has been a more frequent occurrence in the past few years. Periodically, new sources come on the market.
From a long-term perspective, it is to everyone's advantage when coverage increases. In the short term, however, increased coverage jeopardizes the effectiveness of established models. This is because models work best when the distribution of the values associated with the predictor variables remains constant over time.
Net/net rental arrangements. In the presence of net/net rental arrangements, direct marketers pay only for the names they mail. Under such circumstances, the gross names received from a list manager can be run through a statistics-based predictive model. Those names with relatively low scores can be returned without being mailed. The only financial obligation to the DMer is a run charge of about $6 per thousand.
In the absence of net/net arrangements, the list owner will insist on compensation for a significant portion of the names that are processed, regardless of whether they are mailed. Traditionally, the industry standard has been that at least 85 percent of the rented names must be paid for, though more favorable percentages are common.
Without the advantage of net/net rental arrangements, it is very difficult for a model to overcome the cost of paying for discarded names. Consider a list arrangement where 50,000 names are obtained at $100/M on an 85 percent net basis. Assume these names are then screened with a model that results in only 10,000 being mailed. In that case, their effective cost per thousand is $425; that is, 50 times 85 percent times $100, divided by 10. And that does not include overlay data and processing costs.
The role of ZIP code models. For all but the largest mailers, it is difficult to obtain net/net arrangements; hence, the widespread use of ZIP code-level prospect models. This is because their output is a list of ZIP codes to be used for selection or omission. It is easy for list managers to process such a list, and with a very nominal "run charge."
The downside of ZIP code models is that they generally display very modest predictive power. The sheer math explains why. Census data is available for 30,000 ZIP codes, spread across 100 million mailable U.S. households. Therefore, the average ZIP code consists of about 3,300 households. Imagine how difficult it is to predict behavior based on the makeup of the 3,300 nearest residences!
Another problem with ZIP code models is that they often perform erratically across lists. It is common for ZIP models to be effective overall, but fall apart within some portion of individual lists or list types. The reason is what is known as "self-selection bias."
Consider a cataloger that sells business attire to upscale professional women. A ZIP model indicated - not surprisingly - that working-class ZIP codes are not attractive targets. Yet for a sizable minority of lists, the response rate within such ZIP codes was significantly higher than what the model predicted. An analysis revealed that the women on these lists who live in working-class ZIP codes are almost always upscale. The following example provides insight as to why:
One of the lists in question was Neiman Marcus catalog buyers, which consist of very few working-class women. Therefore, any given Neiman Marcus customer is likely to be upscale, even if the average household within her ZIP code is not. Such mismatches are common in urban and semi-urban areas, especially those labeled "transition neighborhoods."