I wrote an article for the Oct. 1, 2001, DM News titled “How Big Should My Test Be?” It contained a formula to help answer that question. Based on the number of e-mails I received, the article struck a chord in the direct marketing community. It is also clear that direct marketers have many questions about testing. I’ll answer four of the most common here.
1. Take a confidence level of 90 percent. Does that mean I can be 90 percent confident that the test panel response rate is what I will see on the rollout?
Unfortunately, no. Instead, it means there is a 90 percent chance that the result relates in some specified way to a reference point. Consider a scenario in which we want to compare a test and control treatment. Here, there is a 90 percent chance that the test treatment will be superior to the control if it is promoted to the universe, and a 10 percent chance that it will be inferior to the control.
When statisticians speak of confidence, they often are referencing two related but distinct concepts. The first is confidence level, and the second is confidence interval. This works as follows:
Assume we test an outside rental list and observe a 1.0 percent response. Based on the test and rollout quantities, and using the appropriate statistical formula, we determine there is a 90 percent chance that the rollout will perform between 0.9 percent and 1.1 percent. The “90 percent” is the confidence level, and the “between 0.9 percent and 1.1 percent” is the confidence interval.
Without the confidence interval as a reference point, the confidence level is meaningless. We can always be 90 percent confident of something having to do with a test result. The question is what that “something” is. For example, with a sufficiently low test quantity, we can be 90 percent confident that the rollout will perform between 0.2 percent and 1.8 percent. For a direct marketer, such a result is of little use.
2. When running tests of significance between a mailed group and control, what level of confidence do you recommend? Many in the industry say 90 percent, though I also have heard 95 percent. Or do you recommend something different?
In many ways, 90 percent is our industry’s gold standard. Generally, however, I use lower confidence levels, but in a carefully controlled way. Often, I find anything over 75 percent is enough for further investigation.
The problem with 90 percent is that it makes it very difficult to beat the control. By definition, it is saying that there is a 90 percent chance the test treatment will be superior to the control if it is promoted to the universe, and a 10 percent chance that it will not be. This is a ratio of 9-to-1.
Consider a test treatment for which there is an 80 percent chance of being superior to the control if the universe is promoted. This is a ratio of 4-to-1, which in my book is darn good. However, by using the 90 percent threshold, such a treatment will be rejected. The same is true for 75 percent, which translates into a ratio of 3-to-1; that is, a 75 percent chance of being better than the control and a 25 percent chance of being inferior.
The beauty of direct marketing is that we can be prudent and incremental in our decision-making. Generally, additional testing is appropriate, especially when confidence levels are relatively low.
It can even be argued that a treatment with a 66 percent confidence level is worth a retest in certain circumstances. At a ratio of 2-to-1, it is twice as likely as not to be superior to the control. Direct marketers often use a new control piece across many, promotions, comprising millions of total contacts. With these sorts of volumes, a successful test treatment can translate into substantial incremental revenue and profit.
3. The formula you provided in the Oct. 1, 2001, article for determining how big a test should be includes the Rollout Universe Quantity. A formula that appears in several direct marketing books and articles does not include the Rollout Universe Quantity. Therefore, it provides results that differ from your formula. Would you explain the differences between the two formulas?
The formula most often used by direct marketers to determine how big a test should be does not contain what is called a “Finite Population Correction Factor.” Implicit is the assumption that all rollout universes are infinite in quantity. If that were true, direct marketers would have to identify only one such successful universe to generate infinite revenue and profit. In the real world, all rollout universes are finite. For niche direct marketers, most rollout universes are small, often in the 5,000 to 200,000 range.
A “reduction to absurdity” will provide intuitive clarity as to why the Rollout Universe Quantity can dramatically affect the appropriate test panel quantity. Assume a test quantity of 10,000 and a corresponding rollout universe of 10,100. Common sense suggests, and statistical theory supports, that we can be much more confident of the test results than if the universe size were, say, 10 million. This is because with a test quantity of 10,000, essentially the entire rollout universe of 10,100 has been sampled.
The Finite Population Correction Factor plays a larger role as the Rollout Universe Quantity gets smaller. For example, assume an expected (“observed”) test panel response rate of 1.0 percent, and that we want to be 80 percent confident the rollout (“actual”) response rate will be at least 0.9 percent:
· With an infinite rollout universe quantity, the required test quantity is 6,985.
· With a rollout of 200,000, the test quantity is 6,750.
· With a rollout of 20,000, the quantity is 5,177.
· With a rollout of 5,000, the quantity is 2,914.
4. All of the testing formulas that I see involve response rates. But dollars are what is most important to my business. Do any formulas deal with dollars? Can the response rate formulas be translated into dollars?
This example illustrates the problem:
Consider two test panels, one with a quantity of 6,010 and the other with 5,270. The first panel has a response rate of 1.29 percent and the second 1.03 percent. Using a statistical formula that we will not discuss in this article, we can be 90 percent confident that, upon rollout, the first response rate will be greater than the second.
In one scenario, the two panels have average order sizes of $110 and $90, respectively, which translate to corresponding dollars-per-piece-mailed cost of $1.42 and $0.93. Here, average order size has accentuated the difference in response rate. But in another scenario, the respective average order sizes are $80 and $99, which translate to $1.03 and $1.02. Here, average order size has essentially neutralized the difference in response rate.
Unfortunately, the calculation of dollar-driven formulas is much more involved than response-driven formulas. To do so requires data manipulation and statistical capabilities. It is also problematic to derive dollar-driven formulas from response-driven formulas. However, an important consolation is that response-driven formulas are much better than nothing.