Genalytics for Predictive Modeling

Genetic algorithms mimic nature’s evolutionary processes by making random changes in a system and keeping those that yield improved results. They are applied when no mathematical solution exists to find the best answer to a problem and too many possibilities exist to test them all with brute force.

They also are used sometimes in applications like package delivery routing or airline flight crew scheduling, where constantly changing conditions mean getting a pretty good answer quickly is worth more than waiting a long time for the absolute best answer.

Genetic methods also have the advantage of being highly automated, so they free up skilled analysts for more creative projects or simply to solve more problems.

Genalytics Predictive Suite (Genalytics, 978/465-6373, www.genalytics.com) uses genetic methods for predictive modeling. The basic process is the same as any genetic modeling system: Predictive variables are randomly chosen, transformed and combined to form models; the models are tested against historical data and ranked by the quality of their results; the best models are selected, modified via more random changes (mutations) and by exchanging components with each other (breeding); and the cycle repeats.

Genalytics takes a relatively structured approach, creating a “gene pool” with all possible treatments for each input variable, such as whether it is used, which mathematical transformations are applied and how outliers and missing values are treated. The precise options depend on the nature of the variable. For later efficiency, a separate record is created for each treatment of the variable and stored in a binary format.

During model building, the system randomly selects items from the gene pool and combines them to form linear equations similar to a traditional regression model. The linear format places some limits on the complexity of the model but has the advantage of being easily interpreted by analysts familiar with conventional techniques.

Genalytics captures interactions among variables by randomly selecting pairs of variables and functions such as multiplication or division. Values based on such interactions, such as average purchase size (total purchase value divided by number of purchases), often are crucial for generating accurate models. Creating them manually is a particularly time-consuming part of conventional modeling methods. Since the number of potential combinations is enormous, Genalytics generates random combinations as it runs rather than creating them all in advance.

One danger with genetic methods is that a strong model may appear early and dominate the breeding of later generations, thereby preventing the subsequent evolution of more powerful alternatives. Genalytics avoids this by running five or six separate modeling streams, which it calls tribes, each with about 25 models per generation.

It also can have an elite tribe that imports the best-performing members of the other tribes to further ensure genetic diversity. The system typically runs 5,000 to 10,000 generations when building a model, testing 500,000 to 1 million preliminary models in total.

Users can specify the number of generations, models per generation, mutation and cross-breeding rates, variables per model and other parameters, or simply accept system defaults. Genalytics uses its own genetic methods to automatically adjust some parameters as it evaluates results during the modeling process.

Execution time depends on the amount of processing and the computer power available. Because the gene pool is created at the start of the process, run time is determined more by available processing power than data access speed. A project involving 1 million test models might run for two days on a cluster of three to five dual-processor PCs. Of course, the computer works unattended most of this time. Genalytics estimates hands-on analyst labor at 30 to 90 minutes.

To further reduce analyst effort, Genalytics provides data preparation tools to extract data from source systems, convert multi-level relational tables to flat files and specify treatments such as grouping of continuous variables into bins. Users also can view correlation coefficients between input variables and the target variable, and can force or prohibit use of particular variables. Genalytics plans to market its data preparation tools as a separate product later this year.

The system automatically tests and validates its models as part of the development process. It also provides reports showing model performance as evolution is under way, plus lift charts, gains charts and the actual equation of the final model. The equation can be shown in XML, SAS or SQL syntax and exported for use in other systems. Genalytics also can import new records and score them itself.

None of this would matter if Genalytics failed to produce superior models. The vendor says tests against existing models at sophisticated marketers have averaged 5 percent to 10 percent improvement in the number of correctly classified records. The main reason cited is that the system tests many more variables, transformations and interactions than an analyst can take time to consider.

A typical project might test hundreds of candidate variables before settling on the best dozen or so. Genalytics reports that some clients accept the system’s models as generated, while others review the results and further refine them manually before implementation. In some cases, clients developed multiple Genalytics models with slightly different objectives and combined the results into a final version. The key to such approaches is the system’s ability to generate multiple models in a short time.

Genalytics is written in Java, which lets it run on nearly any computer. Most installations use Windows servers, though some clients run Linux or Unix. The system automatically takes advantage of clustered servers, directing new calculations to each processor as it becomes available. Input can be ASCII files or SAS data sets. Pricing is based on the size of the installation and value generated for the client. Annual licenses start at more than $100,000 a year.

Genalytics was founded in 1998, and it released the first version of the software in 2000. The system has sold to about a half-dozen clients, mostly major direct marketers or financial services institutions. Another half-dozen evaluation systems are in place.