Challenge: Scoring Big Databases
One of the challenges posed by these very large systems is applying model scores. Building a statistical model is really about the same on systems of all sizes, since the work is usually done on a sample of 100,000 records or less. But with a small database, it's practical to do the scoring of the main file by extracting all the records, running them through the tool used to build the model and posting the results. This might run at 1 million records per hour, which is fine when you have 500,000 records. But try it with 100 million records and you've added four days to the update cycle.
The traditional solution has been to re-create the scoring equation in a mainframe language like COBOL or Assembler. Scoring is then done during the normal update processing. This method gives the necessary speed, but the programming is time-consuming and error-prone. While some modeling systems now generate blocks of C code automatically, there is usually still a need for manual intervention to adjust for differences in compilers and data access methods.
Decisionhouse Suite (Quadstone Ltd., 617/753-7393, www.quadstone.com) uses parallel processing and efficient software design to score large files without generating a separate update program. The system has scored as many as 100 million records in an hour, including time to extract data from a flat file, calculate scores and store the results. Of course, performance depends on the situation: That particular benchmark used an 8-processor Unix server and a file with just 10 data elements per record. According to the vendor, it might have taken twice as long if the data had resided in a relational database rather than a flat file. Still, even allowing for these qualifications, this is a very fast system. And it doesn't require any conversion between the original models and the actual production system.
Naturally, there is more to Decisionhouse than the ability to generate scores quickly. The system is intended to provide nonstatisticians with a full set of tools to explore data, build models and implement the results. It includes data extraction tools that let users link up to flat files or standard relational databases, specify the characteristics of records to select or exclude, define derived variables by transforming the input data and group inputs into data ranges. The extracted data is stored in a proprietary format that allows very fast access and is linked to the vendor's own tools for counting, profiling, cross tabs, 3-D graphing and mapping. Users can drill down on graphs or maps to view the underlying data or make selections.
The modeling is largely automatic. Decisionhouse offers three types of decision tree models, while another product in the suite, called ScoreHouse, offers five additional methods: discriminant analysis, logistic regression, probit regression, gini and Kolmogorov-Smirnov analysis. Models can run with default parameters or with alternate parameters set by advanced users. Outputs can include model scores, propensity measures, odds and decision variables as well as customer-defined calculations such as an expected profit value. Although the system provides several measures for the quality of the models it produces, it doesn't automatically exclude a validation sample to compare against predictions.
Decisionhouse runs on Unix servers and workstations, with a Windows NT version due in December. Prices depend on the file size and modules purchased, ranging from $50,000 for Decisionhouse on a file up to 1 million records, to $300,000 for Decisionhouse, ScoreHouse and ActionHouse (a batch scoring tool) on files of more than 10 million. The system was released in 1996 and currently has about 30 installations, including a mix of production and evaluation copies. Next spring, the vendor plans to introduce Transactionhouse, a product to load marketing databases with high-volume transaction data such as daily telephone call records.
Targeting Optimizer (MarketSwitch Corp., 703/471-4981, www.marketswitch.com) creates response models and identifies optimal promotion quantities. The system does scoring on a dedicated calculation server that accesses the underlying marketing database directly. Using 300-MHz PC calculation server accessing an Oracle database on a single-processor Unix database server, Targeting Optimizer has extracted, scored and posted results on 100 million records in eight hours.
Targeting Optimizer is designed to accelerate all parts of the model development cycle. It includes a graphical interface that simplifies complex extracts from a relational database and lets users define derived variables. Models can use neural net, radial basis or logistic regression, with or without proprietary extensions MarketSwitch has developed. The firm says its methods consistently beat the best conventional models by 30 percent to 75 percent. The system takes three to five minutes to build a model on 50,000 records and test it against a 50,000-record validation sample.
Users can enter promotion costs and order values and conversion rates by file segment. The system will combine these with model results to present gains charts and optimal cut-off points for different objectives such as sales, gross margin, net margin, acquisition cost and net present value. To help understand the model itself, Targeting Optimizer can rank the importance of different attributes and compare the attribute values for top-ranked vs. bottom-ranked records. An application server lets the user schedule an accepted model for execution against a production database.
Targeting Optimizer was released in June and has one current installation. Servers can run Windows NT or Unix servers while workstations can be any Windows system or Unix terminal. Prices start at $150,000 for a five-user license.
MarketSwitch plans additional products to optimize other aspects of marketing strategy. These will simulate the impact of long-term marketing programs against different customer segments, find optimal offer sequences, allocate promotions across sales channels and identify the value of different data elements in a large database. They are scheduled for release later this year and into 1999.
David M. Raab is principal of Raab Associates, a database marketing consultancy based near Philadelphia.