Data Mining's Expensive Hidden Costs
Unless, that is, your company bought into the concept of in-database data mining. This much-ballyhooed approach mines the data where they live instead of on a carefully selected data set. At first glance, this sounds ideal, especially to people in information technology departments looking for a cheap, easy way to accomplish such a vital task. Why should business analysts and marketing professionals go to the trouble of selecting, manipulating and further replicating data to understand customers when all the data can be mined within the database itself?
There are costly repercussions for skipping this step, not the least of which is loss of autonomy for marketing professionals managing the customer relationship management process. In-database data mining usually puts control of sifting through customer data into the hands of database administrators. And if marketers want to maintain their previous level of involvement, their focus could be diverted from the customer as they learn structured query language, the language used to work with data inside a database.
Though in-database data mining might save a little disk space and some input/output cycles, those savings evaporate when you consider the hidden costs.
Gaining strategic value from the data requires digging deeply into it - this means learning some of the irregularities that persist even in the cleanest of warehouses and making the right assumptions for dealing with these irregularities for the analysis at hand. It is important to understand that this is an iterative process and is best done in a lean decision-support environment.
Before your company jumps on this latest get-to-know-your-customers-fast bandwagon, take a moment to consider the downsides of in-database data mining.
One pitfall is that databases were not designed with data mining in mind, and database administrators typically do not have the skills needed for data mining. Data pre-processing for data warehousing is one thing - data pre-processing for mining is quite another. With in-database data mining, this task would be performed by a company's IT professionals, people well-trained in database administration but usually lacking the skill set for mining that data for strategic advantage.
In the same vein, structured query language, the language used to work with data that reside in a database, is not well-suited for data processing. Data mining often requires the creation of new variables and statistical transformations for which standard SQL is not adapted, such as taking the log of an individual's debt or equity, and that is a task more easily handled within a true mining environment.
Most firms make multiple copies of their data anyway, using different sets for different applications, such as online analytical processing, standard reports, special project marts, data mining and extra data mining marts for special projects. Similarly, traditional data mining methods require that the data in a database be sampled and stored on a disk before working with it. Though this might seem like a more expensive approach, the money you spend on disk space is more than made up for by the speed of the process - you get the answer faster.
Time is money. With disk space becoming less expensive, traditional methods are not likely to hit companies' pocketbooks too hard. And by sampling the data, you end up with a data set that, in contrast to a large, cumbersome database, is lean and fast, designed for decision support, not for transaction processing.
In addition, intelligently subsetting the data is commonly done to address strategic questions, such as which customers are most likely to leave. It is often the case that if you don't intelligently subset the data, you could unknowingly introduce bias into the results.
Even in the cleanest warehouse, messy data persist. For example, take a database containing data from an online form where 11/11/11 was the easiest legitimate date to enter in a required birth-date field. Most database ETL - extract, transform, load - tools will not see that there is a spike of 89-year-olds in the age field, resulting from the fact that users who are in a hurry or who do not want to reveal personal information simply enter the easiest birth date. As a result, the messy data persist in the warehouse. The best tools will at least allow a note in the metadata for users to see these anomalies and deal with them in a way that makes sense for the purpose at hand. With this kind of data proliferating in the database, the need to replicate data, at least in part, becomes obvious.
More importantly, because databases were designed for transaction processing, not business intelligence applications like data mining, they are loaded with features not necessary for data mining that hog computing resources and encumber the process of strategic decision-making. As a result, databases are inefficient at handling necessary tasks such as replicating and overwriting data.
In-database data mining is far from ideal. It is likely to be more computationally intense than is necessary since there are so many other processes running constantly in traditional relational database management software. This is an unnecessary drag on the time to execute the results.
And no company can afford to diminish the role of its marketing professionals who have critical domain knowledge of their business and their customers. Data mining results can be put to better use when the people who ultimately use them understand from where they came.
Technical issues aside, before you decide to invest in an in-database data mining tool, consider whether you are willing to give DBAs a strategic role in marketing by having them do a job for which marketing professionals are better-suited.