Beyond the Data Warehouse:


Architectural Alternatives for Data Integration

by Evan Levy, Baseline Consulting

When a prospective client recently asked my opinion about his company building an enterprise data warehouse, I thought about Yogi Berra’s line: “No one goes there these days. It’s too crowded.” After all, doesn’t everyone have a data warehouse?

These days the presumption that integrating data requires a centralized platform is an increasingly debatable one. At Martin Marietta in the 1980s, my colleagues and I were awestruck at what a practical and downright elegant solution our data warehouse represented. In addition to the benefits of mainframe offload and direct end-user access, the data warehouse enabled companies to focus on the information integration and analysis needs of the business users. Furthermore, our enterprise data warehouse imposed the rigor to migrate data in a structured and systematic way.

The Distributed Data Workaround

The success of centralized databases enabled many organizations to provide a “single version of the truth” through the migration and copying of data from operational systems to the data warehouse. The challenge here was the necessary rigor required to integrate new data sources. While the time investment was worthwhile for new (and ongoing) business needs, the investment seemed questionable for intermittent or ad hoc data requirements.

At Martin Marietta, we often pulled data from multiple systems to support one-time-only reports because the data to support these reports didn’t exist on a single platform. While the whole purpose of database technology is to simplify data access and support analysis, we were spending more time loading data than analyzing it. For this particular situation, distributed database architecture—while immature was a reasonable solution. Performance wasn’t stellar, but we could bring data together in a fashion that supported one-off needs without migrating tables and reinventing applications. In those days, network bandwidth, storage, and processing power necessary to migrate large data content was more expensive, and finding a spare hundred gigs for a week was significant.

Since then, data volumes have increased a thousand fold or more. Unfortunately for the distributed database architecture, bandwidth didn’t evolve as quickly and most IT departments abandoned distributed databases for large-scale analysis. Clustered SMP, massively parallel processing, middleware, and other buzz-terms have inspired hope for new ways to integrate data, but they’ve also clouded perceptions of what’s practical.

One way of looking at data integration is to identify the types of business questions that require integrated data, but not in massive quantities. I’ve classified these questions into two categories:

  1. Detailed Inquiry. These are questions focused on details about a single business or subject area. For instance, an end user in Finance needs specific financial metrics from multiple business units.
  2. Content-Bound Analytics. For instance, to review a customer’s complaint a support center representative doesn’t need access to the entire customer base and all of their associated purchase histories, just access to the individual’s account information and purchase details.

In each of these circumstances, the business question is predefined via an application. The application or software package generates a discrete and usually repeatable query and needs a very specific set of data in return. These types of questions don’t require the so-called “full table scan”—or any scan at all. They simply require an underlying knowledge of where that data is and where to go get it. Usually, the answer set is no more than a single record.

New Choices for Integrated Data

Companies can now be more clever and more deliberate about integrating data based on our business and usage requirements. Today’s database-oriented optimizer and metadata technologies can identify which data to move and find that data at its operational source. Integration can be handled using simple (but well defined), source-oriented metadata. The challenge has been simplified: rather than worrying about finding the source data and moving it within a constrained overnight window, this new class of integration methods forces the staff to merely identify the data source. The tool can then determine the appropriate navigation and selection criteria to return the appropriate records.

ERP systems have become a de-facto technology component of many companies to support a single point for all financial and resource information for a company. Many companies are so captivated by these packages that they are willing to modify their operational methods to match the way the actual products function. Good or bad, this capability has introduced a new era in information integration. Previously heterogeneous systems like HR, general ledger, and inventory systems now coexist in ERP environments, alleviating the need for the tremendous amount of custom development that used to be necessary to integrate operational data. Now you can just buy an ERP suite.

EAI (Enterprise Application Integration) tools have come into existence as a means of simplifying data navigation and extraction from packaged applications. After all, while ERP systems integrate data for their own use, sharing their data with other applications is a different issue altogether. From a data warehousing perspective, EAI technology can dramatically simplify data extraction and movement between the data warehouse and different systems. Leading ETL vendors now leverage EAI technology in their toolsets.

Similarly, EII (Enterprise Information Integration) has emerged to support data integration between multiple, disparate systems. The idea is to package pre-built data navigation and extraction technology (such as EAI) to enable metadata-driven data integration. The data is actually integrated with little or no code development.

The question EII addresses is: Why load a billion customer records when I’ll only need to access a single customer? And why integrate all my data sources if I only need a specific view of information from two or three individual sources? As the two categories I identified—detailed inquiry and content-bounded analytics—suggest: Why find, gather, extract, cleanse, and load huge quantities of detailed data when my business requirements tell me that all I need are several discrete data elements at a given point in time?

Data Integration in the Real World

The advent of strategic programs like business performance management and customer relationship management have further defined the need for fast analysis of discrete operational data. A single strategic project won’t justify an enterprise data warehouse, yet companies seeking to get closer to business operations or retain more customers can’t live without analytics.

Consider a banking example: A customer finds out she has overdrawn her checking account and contacts the bank’s call center. While an enterprise view of the customer is helpful, (such as customer’s mortgage balance, line of credit details, or even her value score) it won’t help the CSR address this particular problem. Access to the customer’s account number and most recent transaction is what’s needed.

There’s also an issue regarding data latency. Most data warehouse content freshness is delayed due to the nightly or weekly processing window. Enterprise integration only works when the required data is ready and available—in this instance, only operational systems will do. EII can provide SQL database access while conforming the navigational restrictions of the operational system. This is feasible because the application only requests the records necessary, and integrates them outside the operational system.

So, ERP, data warehouse, ODS, EAI, EII… How do we know when to use what technology? It has everything to do with your business requirements. Not every data integration problem requires massive quantities of data. Your toolbox should be able to address any integration problem (regardless of size) without always requiring big servers or complex software. Your requirements will drive answers to the following three questions:

  • Do business users need discrete data elements?
  • Do they need an integrated view of the data from multiple sources?
  • Do they need an aggregated answer set?

So where does this leave the data warehouse? The data warehouse will remain the workhorse for ad hoc queries against cross-functional data and to support the single version of the truth across the enterprise. Data warehouses are still mandatory for doing large-scale analysis of historical detail and providing the coveted single version of the truth.

For now, anyway.


About the Author

Evan Levy is an industry recognized information technology expert, speaker, and business consultant. As partner and co-founder of Baseline Consulting (www.baseline-consulting.com), he has developed enterprise information systems at Boeing, Verizon, and Charles Schwab.

He has been published in a wide array of industry magazines, is a faculty member of The Data Warehousing Institute, and a featured speaker at Marcus Evans, DCI, the CRM Association, DAMA International conferences, and other industry events.

Evan is a key advisor to technology vendors, corporate executives, and venture capital firms. Major technology vendors — including Oracle, IBM, and BEA Systems — seek his advice in product planning. Corporate executives request his counsel in applying advanced technologies to key business initiatives. The investment communities rely on Evan for insight into new product viability.


Citation

Levy, Evan, "Beyond the Data Warehouse: Architectural Alternatives for Data Integration", DSSResources.COM, 04/08/2005.


John Earle, Baseline Consulting, provided permission to archive and feature this article at DSSResources.COM on February 28, 2005. This article was posted at DSSResources.COM on April 8, 2005.