Ten mistakes to avoid if your data warehouse is to deliver quality information
By Larry P. English
As the era of “second generation” data warehousing has arrived, organizations should be exploiting their information to provide effective business intelligence and to increase customer satisfaction through effective CRM (customer relationship management). But, alas, for many organizations second generation means scrapping their first data warehouse and starting over:
The tragedy is that these examples are not the exception-they are the norm! According to some data warehouse specialists, only eight percent (8%) of all data warehouses are successful after 3 years of operation, as measured by increase in new customers of the data warehouse and increase in new information types being propagated to support more decision support requirements.
The definition of a wise person is “someone who learns from the mistakes of others.” How are organizations learning from their own data warehousing mistakes, and what shall they learn from the mistakes of others?
This white paper is dedicated to those wise persons who can learn from others how to avoid the ten mistakes that, if made, virtually guarantee a failed - or seriously sub-optimized at best - data warehouse. Here’s the list:
Here's how to avoid making the same costly mistakes.
1. Failing to understand the purpose of data warehousing and the types of business problems to be solved with data warehouse information
The first mistake is failure to understand the problems that the organization must solve with the data warehouse. Here we use an encompassing definition of data warehouse to include operational data stores (ODS), aggregated data warehouses (DW) and data marts. (See the Glossary at the end of the article.)
Many organizations have sub-optimized their data marts by taking "departmental" or "functional" collections of data, aggregating them into "departmental" or "functional" data marts. This creates two problems: 1) it merely replicates any problems in design that existed in the source databases or files; and 2) it usually fails to be extendable to "integrate" other department or functional information that can provide an "enterprise" view of its operational health without considerable cost.
A variation of this mistake is to design the data warehouse to solve the problem of getting predominately operation reports of queries that should be produced from operational data, but cannot be for a variety of reasons. Those reasons include poor operational database design, lack of easy access to the operational database among others and poor quality data in the source databases. The objective to redesign a database to meet these operational needs is worthy. The use of a data warehouse to solve these problems (as a permanent solution) is highly problematic. All this does is: 1) make permanent the nonquality operational databases; 2) create information float of extracting, transforming and loading the data into the data warehouse or ODS (operational data store); and it fails to address and solve the problem at its root cause in the operational processes and databases. The correct strategy is to develop an enterprise-oriented "Operational Database" (ODB) as an architected operational database-not just an ODS to which operational systems can be migrated on an as-needed basis.
For a data warehouse to support strategic processes, you must design the data structure to:
2. Failing to understand who are the real "customers" of the data warehouse
Once you know the purpose of a data warehouse, you must understand the customers. One early mistake in data warehousing was the "build it and they will come" syndrome without considering who the "they" was. Without knowing the knowledge workers, or information customers, who have the need for information to make their decisions, you cannot know the customer expectations. Without knowing customer expectations, it is difficult-if not impossible-to design a data warehouse to meet them. Another mistake here was to build a data warehouse or data mart to support only one narrow set of customers. Many times the warehouse design is centered around a narrow set of customers, usually from one functional area. Designing the data warehouse around one small set of functionally oriented information views , virtually dooms the data warehouse to an inflexible design. That design will require even more costly modification to change, because of all of the new ECTL (Extract, Correct, Transform and Load) applications that must now be modified also.
The data warehouse must be implemented incrementally, but not myopically. In order to build a successful data warehouse you must:
3. Assuming the source data is "OK" because the operational systems seem to work just fine
An insurance company was unpleasantly surprised with its data warehouse analysis of newly loaded claims data needed to analyze risk. The data showed a whopping 80% of its claims were paid for a diagnosis of "broken leg." One quick call revealed an all-too-common occurrence; the claims processors routinely-80% of the time-allowed the system to default to a diagnosis code of "broken leg." Why? It was faster than entering a code. Why did they need to enter it faster, rather than enter a correct code? They were paid for how many claims a day they processed. This is the root cause of many information quality problems.
The valid-but inaccurate-data value of "broken leg" was sufficient quality to pay a claim. But was the same data "quality" data for the process "analyze risk"? Absolutely not.
Message: operational data quality will be of sufficient, or "minimum" quality to support the operational processes within the performance measures of the information producers, but not necessarily of acceptable quality for other processes.This mistake dramatically illustrates the absolute denial most organizations exhibit as to the extent of their information quality problems-the assumption that because the operational processes [seem] to perform properly, the same data must be okay to satisfy decision support processes.
Because most of today's applications and databases were designed to support only departmental, functional needs, those databases may contain "quality" data for those operational processes only.There are many root causes, including, but not limited to:
Because of this you must analyze the state of quality of the source databases or files. Your information quality assessment should include:
While this last assessment is the most difficult, it should be performed for all A-priority and most B-priority data, i.e., the most important, even if only conducted on a small sample of 300 records. Why? One company found no invalid values for Marital-Status-Code, but discovered that 23.3% of those "valid" values were incorrect! Without an accuracy assessment, you may have a false sense of security as to the true state of quality of the data.
By conducting a quality assessment, you now know the state of quality. By conducting prototype data-correction activities, you can estimate the level of effort for corrective maintenance.
4. Not developing enterprise-focused information architecture-even if only developing a departmental data mart.
Developing an enterprise-focused information architecture does not mean you must develop a full enterprise-wide data model. But the model of the data subject must contain enterprise-wide data definition, so that it can be extended to support all information stakeholders-not just the departmental workers-in data that is of enterprise-wide interest. The database must also be designed in a way that it is extendable by "addition" of new tables and data elements as opposed to "destructive" change because the table design would have to be modified to support the extensions.
Data warehouses and even data marts are not functional databases. They must be designed to support decisions, cross-functional data analysis and cross-functional knowledge processes-those that need to see an integrated view of the enterprise data.
Data warehouses provide the most value when they solve business problems that cross organizational boundaries.
Failure to design the data warehouse as an "Enterprise" resource will sub-optimize its benefits. If the operational data structures are not well-defined with common, consensus definitions, simply aggregating current departmental data into a data warehouse will exacerbate the legacy data problems of the source. This "departmental" data warehouse will be valuable only to the one business area and will be difficult and expensive to extend to be useful on a broader perspective.
Modeling the data warehouse data model need not be an excessive time-consuming exercise. By bringing the right subject matter experts together with a charter of empowerment and a skilled facilitator can produce a consensus, enterprise-focused data model in no more time than is required to "gather" data requirements using traditional interview and consolidate approaches.
It is impossible to achieve an "accepted" enterprise data model without bringing key stakeholders who represent different functional views of the information together to hammer out a consensus definition of the most common and visible entity types and attributes. There are two approaches organizations make to this dilemma. The first is to say it is too hard and do not attempt it. This condemns the organization to live with a sub-optimal result. The second approach is to acknowledge it is hard-but it is required-and invest in the future of the enterprise and do it right.
5. Focusing on performance over information quality in data warehousing
As the name implies, the product of a data warehouse is "data." Actually the product is information derived from data that supports important decisions, identifies trends that lead to significant new opportunities, and identifies potential problems before they become enterprise-threatening problems. Information quality ("consistently meeting all knowledge workers and end-customers expectations") will always be the most important objective in the data warehouse product. Is performance and response time important? Yes, but not to the threat of enterprise failure if you sacrifice quality, and the resulting decisions cause the demise of the enterprise. What value is it if you get an answer to a query in five seconds instead of five hours if the result is wrong? If five-second response time is a real requirement, chances are good the data warehouse is solving the wrong problem, one that must be solved in the operational system. The worst-case result of this mistake is that the enterprise will make bad decisions faster than ever. The best-case scenario is that the very customers you seek to satisfy, will not be satisfied, will not use-or minimize their use of-the data warehouse, leaving you with credibility problems for the next project you "try" to propose. While response-time performance is a requirement, it must always be secondary to the reliability of the information results. There is an important difference in transaction performance when you are serving a customer, versus when you are analyzing data to support a decision. Timeliness of information delivery, or "response-time" performance must be balanced with the other information quality characteristics of completeness, accuracy, definition completeness and correctness, and presentation clarity and intuitiveness. The wrong answer to a query in superior response time is still a wrong answer. Never mind that it can cause lack of credibility in the data warehouse-the danger is it can threaten the success of the enterprise. The product of the data warehouse is, after all, "Information," information organized to support the strategic and tactical processes of the enterprise, such as trend analysis, Key Performance Indicators or Core Business Indicators, and decisions that impact the future of the enterprise.
6. Not solving the information quality problems at the source
Because the data warehouse tends to be an expensive project, the temptation is to cut the costs of development by cutting the wrong costs.
Many high-priced consultants advise you not to attempt to correct information quality problems at the source. They may be unaware of or uncaring about problems in the source or other value chain processes that are failing; or they may tell you it is "out of scope" to your data warehouse and will "lengthen" the project. They may also tell you that the politics of the organization will not let you get at the source data. If someone tells you this, you know they do not have the best interest of your enterprise at heart. Implementing this advice causes several problems:
7. Inappropriate "Ownership" of data correction/cleanup processes
Often the data correction process is delegated to "take care of cleaning" the data. Management of the source information areas do not have time nor resources to correct it, nor do they have motivation to correct it since their processes "work just fine."
But software alone cannot "cleanse" the data unattended. Nor can IT personnel correct and "cleanse" data to an acceptable level for the downstream knowledge workers, the actual "customers" of the data." The data correction process must be a partnership between the managers of the source information business areas, the data warehouse and information quality staff, and the manager(s) of the downstream information customer or beneficiary business area(s). The managers of the source business areas are the actual process owners of the data correction process, but the requirements of the data correction activities come from the downstream information customer business areas.
Funding for the data correction may come from the beneficiary customer areas or from a one-time enterprise data correction fund. Where you find significant IQ problems, you should:
After the initial data correction activities and the above information quality principles are put in place, the source business area managers are held accountable for the quality of the information required by their downstream information customers. At this point, you may implement charge back mechanisms to charge back the data correction costs to the respective process owners. The quality principle from Kaizen is "make no defects, pass no defects on, and accept no defects."
8. Not developing effective audit and control processes for the data Extract, Correct, Transform and Load (ECTL) processes
The data acquisition, standardization, reengineering, correction, consolidation, enhancement, transformation, aggregation, and loading processes are processes that must be controlled just as any business processes. With the complexity of most environments, where there are multiple source files that must have data about common entity types such as customer, products, or business transactions from different business lines or geographies, there is even more importance on the ECTL processes. Failure to assure quality of the ECTL process is correction to the populated data warehouse, expensive scrap and rework to back out and reprocess the data, or to restore the data warehouse, including the loss of opportunity caused by the downtime and lack of access.
The principles of effective ECTL include:
9. Misuse of information quality software in the data warehousing processes
There are four fundamental misuses of information quality software tools in data warehousing and data migration. They are:
1. Not using IQ software when you should. Your goal is to minimize the costs of information scrap and rework. Automate as much of the corrective maintenance as is possible.
The human intervention to correct data is time consuming and expensive. To be sure it is required for many types of error correction. But seek for ways to automate the routine and correction processes where it is possible. Use your people processes to tackle and correct the data where it is not possible for software to perform it. Use software to:
2. Failing to keep "improving" the ECTL processes as your business grows.
Exploit the tools you have to their maximum capability. A major bank recently "improved the IQ software cleansing" processes that were implemented in 1992 and have been largely unchanged since them. They discovered many of the original business rules were no longer effective for matching and consolidating their customer data. New business rules they have introduced since then, and new types of customers were being overlooked in the data consolidation process. They enhanced the business rules for matching and consolidation, and increased the effectiveness of the matching processes.
3. Assuming that IQ software will "solve" all your IQ problems.
One organization "cleansed" their Customer address data with one of the cleansing software tools. They subsequently conducted a physical accuracy assessment and discovered a twelve percent (12%) error rate in that "cleansed" data. The result include customers who had moved, address numbers within a correct range of numbers for the street, but inaccurate, among others. The fact of the matter is that no software tool by itself can correct all data to a zero-defect level.
Please note that the message here is not to avoid the use of IQ software tools. The message is to know the limitations of the tools and to implement effective IQ processes that incorporate IQ (accuracy and customer satisfaction) assessment, process improvement, and transformation of the enterprise culture to a culture of quality of information products. The goal is to minimize the costs of low quality data and increase the opportunity gain as a result of high quality information.
Errors can be left uncorrected by the software-or even introduced by it. Such errors include:
To derive maximum value from your IQ software tools, you must implement them in the context of a robust Information Quality Management framework with well-defined, effective IQ processes that focus on defect prevention and customer satisfaction. You must understand the strengths and weaknesses of your IQ software tools. Exploit the strengths, and put in place processes that mitigate against the inherent limitations in the software.
4. Not using your IQ software for defect-prevention once you have your batch cleansing/corrective maintenance processes in control.
The goal of information quality management is to design quality into the processes that produce or capture data at the source, maintain it, and move it when necessary in a controlled way so that the information "consistently meets [all] knowledge worker and end-customer expectations."
Remember that the money spent in data cleansing or correction is a cost of "information scrap and rework." If information is created, and there are processes in place to keep it maintained and current, there is no need for extensive data cleansing. It is cost-adding to fix something that was not captured correctly, maintained correctly as it changed or was corrupted in the process of moving it. Therefore, the most effective IQ software tool capability is defect prevention. Defect prevention is the capability to invoke routines, and apply tests to assure data is correct and non-duplicate within the applications that create and maintain data and that control required data movement processes.
Use information quality software effectively and efficiently to analyze, assess and correct or cleans data.
Information Quality Management requires implementing processes of quality assessment, process improvement, education, and transforming the performance measures to include customer satisfaction of downstream information customers.
10. Failing to exploit this opportunity to "correct" some of the wrongs created by the previous 40 years of bad habits
The data warehouse opportunity represents an opportunity for organizations to learn from their past mistakes. When developed properly, as described above as to how to avoid the mistakes that can cripple the data warehouse, squander valuable resources, and damage your credibility, you have a golden opportunity to reengineer the business processes and legacy applications to significantly reduce the waste and increase the value of information systems and services.
The data warehouse provides an opportunity of a way out of the legacy problems of the past. By building an Operational Database or ODB (not just a retrieve-only Operational Data Store (ODS)) designed to reengineer the operational data structures, you can begin to migrate source applications to a new architected and truly integrated operational data environment. This enables a phased approach to eliminate legacy applications as they come up for reengineering. For organizations with ERP systems that invariably do not contain all the data your organization needs to know about its important information types, the ODB is a way forward. The ODB is a mechanism that can be either an origination source for master data for an ERP system, or an extracted and accessible record-of-reference database in which additional attributes can be added to ERP data that enhances its value to other enterprise processes and applications.
Avoiding these mistakes not only reduces the risk of data warehouse failure, it increases the value of the information products provided. With that, there will be an increase in the effectiveness of business decisions that are supported by your data warehouse environment.
The vision for using an ODB as a basis for an integrated operational environment is described on pages 278-283 in Improving Data Warehouse and Business Information Quality, by Larry English, New York: John Wiley & Sons, 1999. The book further describes the six processes of TIQM that can significantly reduce the cost of information scrap and rework and increase the opportunity gain enabled from your data warehouse and operational information environments.
English, Larry P., Improving Data Warehouse and Business Information Quality, New York: John Wiley & Sons, 1999, p. 24.
Data Warehouse (DW): is a system of information technologies and a collection of data designed in a way to support the strategic, management and decisional processes of the enterprise. In other words, data designed to support the business in ways that operational databases cannot.
ECTL: Acronym for Extract, Correct, Transform and Load, the processes of propagating data from a source to a target database or data warehouse, that includes retrieving data from a source, correcting it (in the source), converting it to the proper format, aggregating it, and storing it or updating it into the target.
Knowledge Worker: 1. The role of individuals in which they use information in any form as part of their job function or in the course of performing a process, whether operational or strategic. Also referred to as an information consumer or customer. Accountable for work results created as a result of the use of information and for adhering to any policies governing the security, privacy, and confidentiality of the information used. 2. Coined by Peter Drucker no later than 1973, the term refers to business personnel or "workers" in the Information Age who use information in their work, as opposed to the term "user."
Operational Data Store (ODS): A database designed to become standard and enterprise sharable at a detailed level used to support operational activities of an enterprise. It is used as a means for reengineering of operational applications to eliminate obsolete legacy databases or files in a planned, controlled fashion.
About the Author
Larry P. English, President and Principal of INFORMATION IMPACT International, Inc. (www.infoimpact.com), is an internationally recognized speaker, teacher, consultant, and author in information and knowledge management and information quality improvement. He has provided consulting and education in 28 countries on five continents. Mr. English was featured as one of the "21 Voices for the 21st Century" in the January, 2000 issue of Quality Progress. DAMA awarded him the 1998 "Individual Achievement Award" for his contributions to the field of information resource management. He has chaired 10 Information Quality Conferences in the US and Europe.
Mr. English's methodology for information quality improvement-Total Information Quality Management (TIQM™)-has been implemented in several organizations worldwide. He writes the "Plain English on Data Quality" column in the DM Review. Mr. English's widely acclaimed book, Improving Data Warehouse and Business Information Quality, has been translated into Japanese by the first information services organization to win the Deming Prize for Quality. English provides consulting and education to organizations to help them implement effective IQ processes that reduce the costs of nonquality data and exploit the value of quality information.
Mr. English's mission is to provide guidance, consulting and education to organizations to help them increase their competitiveness and effectiveness by implementing information management and information quality processes that reduce the costs of nonquality data and exploit the value of their information assets. He can be reached at Larry.English@infoimpact.com.
Larry P. English provided permission to publish and store the article "Mistakes to Avoid if Your Data Warehouse is to Deliver Quality Information" at DSSResources.COM on Friday, July 26, 2002. The final article was received from Diane English on Friday, August 2, 2002. This formatted version of the article was posted at DSSResources.COM on August 11, 2002. Copyright © 2002 INFORMATION IMPACT International.
English, L. P., "Ten mistakes to avoid if your data warehouse is to deliver quality information", DSSResources.COM, 08/11/2002.