Better decisions through better data quality management

by Tony Fisher, President and General Manager DataFlux, and
George Marinos, National Partner, Data Quality Practice PricewaterhouseCoopers

In today’s market, everyone is seeking a way to gain competitive advantage. Most companies are focused on building new systems, implementing new strategies, or identifying new markets. What is often ignored, or assumed, is management of the quality of the data that supports decisions and how it can be improved to help gain a competitive advantage.

Organizations depend on data. Regardless of industry, revenue size or the market it serves, every company relies on its data to produce information for business decision-making. The quality of the results from any analysis is only as good as the quality of the inputs (the data) that feed that analysis. Given today’s accessibility and quantity of data, people often try to perform new types of analysis to gain a more competitive edge. Many times they are trying to answer the same question; however due to poor data management they seek alternate analysis methods instead of trying to improve their existing analysis. But with so much riding on it, does data quality get the attention it deserves in your organization?

No one wants to admit that their business decisions may be based on inaccurate or incomplete data. However, recent surveys by PricewaterhouseCoopers and a 2002 survey conducted by The Data Warehousing Institute (“Data Quality and the Bottom Line”) suggest that a majority of firms have insufficient data quality management processes in place.

This implies that many enterprises have never assessed the potential severity of data quality problems and their impact on the bottom line. The most sophisticated databases and warehouses will not ensure business success if data is inconsistent, redundant or full of errors. More often than not, data quality problems are not given the attention they deserve until an underlying application or system fails, such as CRM or data warehousing. Not until an initiative is deemed a failure, or the return on investment (ROI) is not achieved, does data quality come to the forefront.

Effective data quality management relies on a combination of people, process and technology. Herein we explore the major issues of management and how to utilize these three elements to achieve better data quality. Our methodology is illustrated by two case studies of successful data quality management implementations.

WHAT IS DATA QUALITY?

Simply put: “Good data quality means that an organization’s data is accurate, complete, consistent, timely, unique and valid”. Data quality is often defined as a process of arranging data so that individual records are accurate, updated and consistently represented. Accurate actionable information relies on “clean” and consistent data such as names, addresses, e-mail addresses, phone numbers; or non-name and address data such as part numbers, product names and sales figures.

Data is a key strategic asset, and ensuring its quality has become a business imperative. However, today an organization’s data comes from a wide variety of sources, including legacy systems, databases, external providers and the Web. The result is tremendous differences in formats and origins. Quality is often compromised and integration becomes impossible. Spending the money, time and resources to collect massive volumes of data without ensuring its proper management is ineffective and often leads to failed business initiatives.

PROBLEMS AND CHALLENGES IN CREATING TRUSTWORTHY DATA

Bad data can develop in a number of ways - from errors in data entry, incorrect data received from internal or Internet forms, discrepancies between different systems or faulty data acquired from an outside source. It is also common to find good data combined with outdated information so that it is impossible to distinguish between the two.

To create good data, organizations must examine all available information and create a data management plan for how it will be used. You will also need to determine which data is good (accurate, complete, timely, unique and valid) and which is bad, and to decide how bad it is. Most organizations have little or no idea of the quality of the data residing in their systems and applications. In a 2001 PricewaterhouseCoopers survey, only one-third of responding companies were very confident in the quality of their data.

There is a good reason for this: many data anomalies plague most corporate systems. These include:

  • No standardization:
    • Differences in the spelling of names, corporations or other common words and abbreviations, such as Street, st and St.
    • Salutations may vary from one application to another, such as Mr., Mister, no salutation, etc.
    • Formats are often inconsistent across applications. For example, different systems use a variety of date formats: dd/mm/yy, yyddmm, etc.
    • Many systems overload these fields and allow entries such as “Unknown,” “Joint Account” or “Corporation.”
  • Incorrect data. Often, special codes are utilized to signify unknown or default data. For example, entering a social security number of 111-11-1111 or a birth date of 01/01/ 01 may be permitted by systems in order to complete a transaction when data is unknown.
  • Data does not adhere to corporate business rules. For example, a salary may be out of the range of a particular salary grade.
  • Data is stale. Data that has decayed or changed over time is now of little value.
  • Data may be ambiguous across systems, with no consistent key to permit the joining of the information across systems.

For example, names and addresses may be depicted in various ways, depending upon the system in which the data was originally entered. Two separate systems might use different numbering schemes to encode customer information, with one system using the customer’s last name and a number, and the other one using a random number. There are then different ways to represent the same customer.

ACHIEVING EFFECTIVE DATA MANAGEMENT

In response to these problems, today’s enterprises are increasingly seeking solutions to improve their data quality. Let’s look more closely at the people, process and technology required for truly effective data quality management.

People

Data quality has traditionally been considered a standardization activity best left to the database management team. The tendency has been to allow IT departments to not only physically change and correct the data, but to also create the rules and routines by which the data is transformed. Unfortunately, what is considered good, clean, usable data from a technical standpoint may not be the complete, accurate and timely information the business user (the ultimate “consumer” of the data) needs to perform a given set of tasks or make informed business decisions.

Effective data quality requires that business users define what constitutes “good” data for a particular task, process or project. Data quality then moves from being about the technical “correctness” of the data to a concept of data integration, where a company’s accumulation of digital information is transformed into a strategic asset by creating a consistent, timely, reliable view of enterprise data.

Process and Technology

Now, let’s examine the basic data quality management process and technologies that help deliver high-quality, consolidated data to the right place at the right time. The needs and goals of organizations vary greatly. At the same time, the level of data quality depends heavily on how the data is collected, stored and used. Therefore, a general process is required to unite the key components of data quality management, supported by a technology that can effectively implement them. The four basic steps of this general process include:

  • Data Profiling
  • Data Quality
  • Data Integration
  • Data Augmentation
Data Profiling

Data profiling is the phase of discovery: when you learn to understand, locate and document problems with your data. Listing every database and table-even identifying the departments that generate problem data - will help to establish where to drill down for signs of a data quality problem. The drill-down phase may include profiling suspect data for ranges, performing frequency counts or checking for valid primary key relationships. You may decide to search for patterns, duplicate records or valid value analyses.

The discovery phase is critical for generating a new set of basic operating rules and procedures. Remember, this phase is only about looking at your data and collection or transformation processes, and generating or applying techniques to fix problems. Resist the temptation to immediately begin transforming the data, without first exploring potential sources of trouble. Data that seems to be in error may in fact turn out to be the correct representation after examining it or the process from which it was generated.

Data Quality

Data quality techniques are designed into software applications to improve the accuracy of data. Two such processes are data standardization and data validation.

Data Standardization - Unfortunately, data can be ambiguously represented - this is the very root of most data quality problems. If multiple permutations of a piece of data exist within a dataset, every query or summation report generated by the dataset must account for each and every instance of these multiple permutations. Otherwise, important data points can be missed and severely impact output. For example, a product name can be represented a multitude of ways:

Blue Turtle Neck, Turtle Neck: Blue, B Turtle Neck, Shirt: TurNeck B

Or an address:

100 E Main Str, 100 East Main Street, 100 East Main, 100 Main St, East.

Each item has the same meaning but is represented very differently, leading to a wide variety of potential analytical problems. Suppose that a life insurance company wants to determine the top ten companies that employ its policyholders in a given geographic region, in order to tailor policies to those specific companies. If its data is in the above condition, inaccurate aggregation results are likely, due to all the possible permutations of data for a given company name.

Such simple data inconsistency examples are endemic to databases worldwide. Fortunately, data quality technologies now exist that identify the various permutations of data and can rectify it according to rules established by the user. These include physically standardizing the data within the dataset, creating synonym tables/filters or correcting undesired permutations before they enter the dataset in the first place.

More importantly, these rules for standardization can be maintained external to the application or dataset and be applied to standardize data across the enterprise.

Data Validation - Every company has basic business rules. Loosely defined, business rules consist of the metrics that identify whether your organization is operating according to preset standards. These business rules cover everything from basic lookup rules such as:

Salary Grade

Salary Range Low

Salary Range High

20

$25,000

$52,000

21

$32,000

$60,000

22

$35,000

$75,000

To complex, very specific formulas:

Reorder_Quantity=(QuantPerUnit*EstUnit)[Unit_type]-Inventory_onHand

Many basic business rules can be checked during data entry and rechecked on an ad-hoc basis. The problems that arise from lack of validation can be extensive: from over-paying expenses and running out of inventory to undercounting revenue. Again, today’s applications can store, access and implement these basic business rules to assure data validation. Rules should be stored external to an application so they can be shared by all applications, avoiding conflicts across data stores.

Data Integration

Data integration is a combination of:

  • Data Linking: finding common fields between databases or within a single database
  • Data Consolidation: joining or merging these common fields

Data Linking - Data linking becomes an issue when the columns that constitute the join fields between multiple datasets contain data that is inconsistently represented. Data linking in an application environment may be explicit, such as when multiple data sources are physically joined and a new data store is created. The data can also be linked implicitly, using the match code data to perform a runtime join without creating a consolidated data store. For example, trying to combine a customer table with an outside demographic data source will have undesirable results if the join column contains ambiguous representations of data.

The following example shows multiple representations of the same company name:

Data Source A (Customer Dataset)

Columns: Customer Name, Contact

Data: First Bank of Denver, Joe Snow

Data Source B (Demographics)

Columns: Company Key, Num Employees, Business Type, Annual Sales

Data: The 1st Bank of Denver, 850, Financial, $62 million

Obviously, a standard join statement would not recognize that these two banks are the same organization, and the demographic data would not be joined to the customer data. To achieve a join in this scenario, a match code can be used that distinctly represents the company name. The code is generated by data quality algorithms, and may look something like: RX19E4. The same code will be generated when any variation of the “First Bank of Denver” is passed through the match code generator. This code then becomes the basis of the match between data sources. These codes can be stored as an extra column in each data source, stored in a temporary table or file or generated solely at runtime.

While data integration may not always be considered a data quality problem, algorithms and procedures can achieve much higher match rates when combining data from multiple sources. Such integrated datasets form the basis for many business intelligence applications.

Data Consolidation - Once you have determined that multiple records represent the same data element, you must determine what process to follow to consolidate the duplicate/redundant data. Some of the problems that can arise from redundant data within datasets include inaccurate analysis, increased marketing/mailing costs, customer annoyance and breakdowns across relational systems (such as CRM systems). Since data serves as the foundation of your business intelligence infrastructure, it is imperative that these situations be identified and eliminated. The following are examples of duplicate data that cannot be caught without some form of data quality technology (or else long, hours of human inspection and effort, unlikely to catch as high of a percentage, and extremely difficult with anything more than small volumes):

Robert Smith, 100 E Johnson Street

Bob Smythe, 100 East Johnson

Dr. Robert J. Smith, 100 E. Johnston St.

Once you have determined that multiple records represent the same data element, you must decide which process to follow to consolidate the duplicate/redundant data. Again, because data can be ambiguously represented, the same customer, prospect, part, item, transaction, or other important data can occur multiple times. In cases like these, the redundancy can only be determined by looking across multiple fields, requiring a data quality technology tool. Your consolidation process may consist of deduplication (removing certain records), merging (choosing the best information across multiple records) or keeping the information from all data sources.

Data Augmentation

You can make applications still more effective by resolving missing and/or inaccurate data, using an external data reference. This includes not only filling in missing values and replacing inaccurate ones, but also adding data values to a record or data observation to provide a more complete picture of the entity stored in the dataset.

A common example is using the United States Postal Services’ master address database to verify and/or correct existing addresses within the corporate database. This can greatly increase address integrity, as well as provide a basis for additional applications such as geocoding, mapping and other visualization technologies that require a valid address as a starting point. Technology such as this can make an effective addition to business intelligence applications.

DATA QUALITY METHODOLOGY IN ACTION

Following are two case studies which demonstrate how the above methodology of people, process and technology works in a real-world setting.

CASE STUDY: PROCUREMENT ANALYSIS

In order to achieve cost savings goals defined by senior management, the procurement officer of a global insurance organization initiated a project to analyze how and where the organization spent its monies. The officer wanted to learn: 1) what was the total amount of money spent; 2) which products or services were being purchased; and 3) from which vendors the organization was making its purchases.

Having this information would help the organization negotiate more favorable contracts; consolidate purchasing of specific products or services; and determine the level of compliance with existing vendor contracts.

While rudimentary, these questions had gone unanswered up to this point because the data required was stored in multiple systems, no link existed to combine the data from these systems and some of the key data were not captured by the systems or were captured inconsistently. The project’s success depended upon integrating, standardizing and enhancing the quality of this data. The officer therefore implemented the process depicted in Exhibit 1 below:

This solution allowed the procurement team to obtain a full picture of their suppliers. For example, they found that more than 20 accounts in the various source systems represented one normalized account. Without the above process, the analysis would have shown these as separate organizations, and that leverage in negotiation would have been lost. Furthermore, without standardizing spending categories, the analysis would not have revealed the true amount being spent on various products and services. The information would have been either too broad (e.g., information technology) or too detailed (e.g., a specific type of computer).

Such a solution clearly includes the three key elements of people, process and technology. The people consisted of the analysis team and the end users who fully understood the data. The process consisted of piecing together the various steps, while the technology consisted of dfPower Studio (DataFlux) and the procurement database. This case study vividly illustrates the importance of data quality and its impact on business decisions. The questions being asked were not complex, but getting answers based on high quality, integrated data was critical to the success of the project.

CASE STUDY: FIGHTING MONEY LAUNDERING

The U.S. Patriot Act has caused compliance departments in a number of industries to take a hard look at the quality of their data. For example, the compliance officer of a global banking institution was asked to ensure that his organization was in compliance with the Act. Specifically, he needed to gather additional information on the bank’s customers and to clean the existing information. He also needed to put the appropriate policies in place to ensure that data would be maintained at the quality needed to appropriately rate clients for risk.

The bank therefore needed to implement a solution that would take existing customer data and properly link it to additional information provided by third parties. After the initial linking, existing systems would be updated to capture this additional information. The analysis hinged upon the quality of the source data and the ability to link the customer database to a standardized listing of the required fields (e.g., country of jurisdiction, address, name) contained in the data obtained from third parties. The process in Exhibit 2 was implemented:

This process was chosen due to the large volume of data involved and the short timeframe required to implement this solution. In order to ensure the highest quality matching, a separate manual validation process was implemented, giving end users the ability to either accept the automated link or override. In this case, a separate project provided an additional data element that had not been captured but was critical to the bank’s compliance with the U.S. Patriot Act.

Again, this solution includes people, process and technology. The people consisted of the project team and the end users who properly understood the data. The process consisted of piecing together the various steps, while the technology consisted of dfPower Studio (DataFlux), the database of current customer information and another database storing the newly gathered information.

CONCLUSION

The quality of any analysis is only as good as the data upon which it is based. All too often, companies are finding that a variety of analyses still yield faulty results, because the data is not of high quality. Their success now depends on finding a data quality solution and methodology that integrates every aspect of operational and analytical applications. The environment depends on addressing four building block issues - data profiling, data quality, data integration and data augmentation.

Today’s easy-to-use data quality management technologies paired with the appropriate process advisement make this once overwhelming endeavor far less costly, difficult and time-consuming. Powerful new data management solutions allow firms to support vastly improved data-driven initiatives. By involving the right people within the organization, creating a powerful process, and implementing the latest technologies, your data quality management initiative will enable better business decisions and help your enterprise achieve competitive advantage.

About the Authors

Tony Fisher has held the President and General Manager position at DataFlux since September, 2000. Tony was the Director of Data Warehouse Technology at SAS prior to the SAS acquisition of DataFlux in June of 2000. He has been a key technology leader at SAS, providing the engineering research and development direction that has made SAS the market leader in data warehouse software sales. Tony's prior role at SAS gave him critical experience in the marketplace in which DataFlux operates. After the acquisition, SAS appointed Tony to lead DataFlux. He is a North Carolina native and earned a degree in computer science and mathematics from Duke University.

George Marinos is the PricewaterhouseCoopers Partner for New York Metro Region’s Data Management Group (“DMG”). He has over 17 years of experience in the systems and computing field. As a member of the DMG he oversees a staff of 40 professionals specializing in the areas of business intelligence, transaction analysis, and data transformation. George has recently managed a document management and imaging project a assist a major insurance company with their demutualization efforts. Prior to joining PricewaterhouseCoopers George was Director of Information Services for a major Human Resource/Benefits Consulting firm. George started his career in computing with a major international commercial bank, where he designed and implemented trading, profit and loss and money transfer systems.

If you have any comments or observations, you can contact the authors at: Tony Fisher, President and General Manager DataFlux, 4001 Weston Parkway, Ste. 300, Cary, NC 27513, Tel: 919-674-2153, Fax: 919-678-8330, tony.fisher@dataflux.com, www.dataflux.com. George Marinos, National Partner, Data Quality Practice PricewaterhouseCoopers, 1177 Avenue of the Americas, New York, NY 10036, Tel: 646-471-8861, Fax: 646-471-8949, george.marinos@us.pwcglobal.com, www.pwcglobal.com.


Citation

Fisher, T. and G. Marinos, "Better decisions through better data quality management", DSSResources.COM, 10/03/2003.


Tony Fisher and Daniel Teachey, Director of Public Relations, DataFlux Corporation (A SAS Company), provided permission to archive this article and feature it at DSSResources.COM. This article was posted at DSSResources.COM on October 3, 2003.