Data Integration Requires the Right Tools
Companies have turned recently to three technologies to create solutions for customer data integration. These are data movement tools such as extract-transform-load (ETL), data query and aggregation tools such as enterprise information integration (EII) and data quality (DQ) tools. But what the tool vendors aren't telling you is that these tools are inadequate for developing a reliable CDI platform.
Customer hubs emerging. Industry market research firm Gartner Inc. defines CDI as "the combination of the technology, processes and services needed to create and maintain an accurate, timely and complete view of the customer across multiple channels, business lines and enterprises where there are multiple sources of customer data in multiple application systems and databases."
Several implementation styles of CDI solutions exist, but the most effective is where an enterprise commits to build and manage a customer hub that serves as a central repository of customer data reconciled from multiple data sources. This hub may contain some or all of the critical customer data needed to provide multiple customer views to downstream applications.
Data tools ill-suited. In the past decade, many companies that tried to build an in-house CDI hub using ETL, EII and DQ tools are now struggling with the aftermath of a custom solution. There are several reasons for the failure of CDI solutions built with these tools.
First, all three technologies originated for narrow purposes ill-suited to CDI: ETL to move large volumes of data in batch mode; EII to run distributed queries across disparate sources in real time; and DQ tools to scrub incorrect names and addresses in a single source at a time.
Each technology effectively supports only a single data modality, batch or real time. Since customer data is inextricably tied to both operational and strategic business processes of a company, such as order-to-cash process or profitability segmentation analyses, it needs to be delivered in time for each business process.
Therefore, any CDI solution needs to support a range of modalities of data movement: from a large-volume batch process that loads a new source into a customer hub; to scheduled intra-day batches; to a publish-subscribe model for immediate updates of critical data. Tools designed for single modality can hamper the reliability and scalability of a CDI solution.
Treat different data types separately. To build a reliable CDI solution, one must treat different types of data separately, such as master reference data, relationship data or transaction data.
Master reference data is the foundational entity data (such as name and address) that is critical for uniquely identifying a customer across multiple systems and channels. Without a trustworthy hub of customer reference or profile data that serves as the "system of record," other types of data cannot be aggregated reliably. Such a master store should create and maintain the best-of-breed record for each customer culled from all relevant internal and external data sources - at the cell or attribute level - along with the associated cross-reference keys. This store then becomes the best source of truth for customer profile information for all downstream operational and analytical applications.
The next type is relationship or hierarchy data. It defines the relationships among various entities such as individual to organization, organization to organization or individuals within households. Relationship data can be managed reliably across different sources only after the underlying conflicts of master (entity) data are resolved. Most custom solutions have fixed relationships among entities embedded in the system's data model, which makes it hard for IT to manage changes in customer relationships and affiliations.
The third type is transaction or activity data, such as amount withdrawn from an account. Despite challenges in managing large volumes of such data, there is usually little conflict in reconciling it since there is an unambiguous system of record for each type of transaction. The issue lies in attaching these transactions correctly to the same customer across multiple CRM touch points and then aggregating them accurately for other applications to consume (such as the average account balance). Note that transactions can be aggregated for the right customer or household only after the ambiguities of the associated master and relationship data are removed.
Essentially, without treating different data types separately and establishing a reliable foundation of master data at the start, a trustworthy CDI platform cannot be built. Yet none of the data tools maintains separation of data types.
ETL tools neither recognize nor treat master data apart from other types. EII tools assume that all federated data results are clean and unambiguous. They rely on an external source to provide correct cross-reference keys and global IDs to accurately join the results of a federated query. DQ tools provide ad-hoc cleansing of a source, but do not recognize data types nor offer ongoing management of data changes.
The challenge of data models. One key reason that custom solutions are inextensible is because of their instantiation of a fixed data model in a physical database repository or data warehouse.
This fate is shared by "packaged" CDI solutions offered by application vendors. In a large enterprise, rarely does one vendor have access to all sources of customer data, external and internal. Therefore, standardizing on the application vendor data model means more work, not less, since every data source outside the vendor application must be transformed to feed into the vendor's customer data hub.
The best approach is to create a template-driven, logical data model specifically for each enterprise reflecting all its customer data sources that need to be integrated. The solution provider must deliver a data model and solution framework cognizant of the needs of each major industry vertical. None of these data tools attempt to address the challenge of data models for a diverse set of data sources encountered in various verticals.
Meta-data-driven framework needed. The most fundamental shortcoming of ETL/EII/DQ is that they do not offer a meta-data framework to manage the complete set of data management tasks required of a CDI solution. Each tool, along with the numerous enterprise application integration (EAI) technologies, solves only a narrow integration issue within the IT "stack" - integrating application to application, moving data to single warehouse, cleansing a single source, etc.
A CDI framework needs tools for all processes associated with managing the data types. For example, the framework should address the full life cycle of master reference data: model, cleanse, match, merge, share, extend and manage. The solution should allow customer and organization hierarchies across data sources to be leveraged instead of tied to a fixed hierarchical view of an implementation. The solution should readily access all relevant customer activity data and unify it with other data types for a complete view (through caching or aggregation).
For the solution to manage data changes without software programming efforts, it must be driven by meta-data that captures the data syntax, semantics and business rules relevant to integrating customer data into unified views.
It is important to maintain the distinction between managing meta-data through a generalized meta-data tool versus having a meta-data-driven framework designed for a specific purpose (such as CDI). A meta-data-driven framework captures, stores and uses highly contextual meta-data tied to a business purpose (such as, when was a customer address changed and by whom). By separating meta-data from its business context, a generalized meta-data tool often limits its business value.
The key advantage of a meta-data-driven CDI framework is that it renders the solution entirely configurable so that business and IT changes can be implemented rapidly without writing code. Since the framework is manageable by business analysts and data stewards as well as by IT, such a solution becomes the foundation for all unified customer views in an enterprise. Additional data sources are easy to incorporate, without more programming, as businesses evolve through mergers and acquisitions.
Because the custom CDI solutions built with ETL-EII-DQ tools are not meta-data driven, they are not manageable by data stewards, are hard to configure and generally are not extensible beyond a few sources.
Service-oriented architecture is critical. If a customer data hub is to be the central repository of critical customer information, it needs to have critical capabilities to synchronize reliable data back to source systems. Also, such a CDI solution needs to support standards-based service-oriented architecture (SOA) so that its underlying data services may be used by future service-oriented applications. Hubs built by data tools typically don't offer these critical capabilities.
Though necessary components of the data integration architecture, ETL/EII/DQ tools are neither designed nor able to build a trustworthy foundation for CDI. For the same reason you wouldn't hire a plumber to build your house, organizations should not rely mainly on these technologies when developing a customer data foundation.