
Philip Teplitzky (phil.teplitzky@hpsquaredllc.com)
Marc Hurst (marc.hurst@hpsquaredllc.com)
March, 2010
Abstract: Data Quality is a critical factor in an Enterprise’s ability to direct its operations based on business patterns and transaction history. Although perceived as a responsibility of the information technology practices it is in reality a responsibility that spans operations and back end data analysis functions as well. This paper discusses the breadth and scope of data quality metrics and characteristics. It suggests pragmatic treatment of metrics and presents guidelines for establishing effective measurements for data quality. It offers three dimensions for discussing and analyzing data quality. Discussion and insight into the difficulties in implementing a successful data quality program are augmented by suggestions accordingly.
Key Words: methodology, data quality, data governance, analytics, enterprise data, information management
Introduction
Data Quality is one of the foundations of Information. Information is composed of Data plus structure and context. However, if the Data is wrong then you no longer have Information but noise and without Information it is not possible to run a modern competitive company. Often is heard in the hallowed halls of American business the cry:
The data is wrong
Why don’t the reports agree?
Why does Marketing have different number then sales?
How many customers do we really have?
What are the inventory numbers?
And on and on the differences and disputes go! It is not possible to make well considered and analytic judgments in the absence of reliable, consistent and dependable Information. The result is that many important business decisions are made on GUT feel and best available information, the results can be, and often are less then optimal and in some cases fatal. The companies with the best Analytics and the best Information win; the ones without loose and perish. Knowing what you have done, with whom, knowing what is hot and what is not, and who your best customers are can mean success or failure in today’s highly competitive world market. Perhaps the best example of this rule was the Official Airline Guide. At the peak of its popularity it was worth more than the airlines it reported on, the information was more valuable than the Planes. But it became obsolete and irrelevant when the internet made airline schedules and flight information arable instantaneously and for free. The Information and speed of delivery and context became the winner. The key lesson, today’s information is yesterday’s success without Quality, Context and delivery.
The lack of quality and by inference the decline in the usefulness of the Information is an unacceptable fact of life. Why are the numbers wrong, why do the operational systems and the analytic applications have different values? Why is production and marketing out of synch with their projections? The answer is simple and yet profound in its implications and solution – the problem is THE DATA and specifically the Quality of the Data!
If the data is wrong then everything is wrong. It does not matter how elegant your analytic models are, how insightful your predictive models are, if the data is wrong the Information is wrong. Without an acceptable level of data quality there is no information. The problem dear Brutus is not in our stars but in our DATA! If the problem is Data Quality how do we make it better? That is the raison d’être of this White Paper.
Lord Kelvin (Sir William Thompson, Baron Kelvin of Largs) the great 19th century scientist said it best:
To Measure is to Know, if you cannot measure it, you cannot improve it
How do we in fact measure the Quality of Data, what are the metrics? And what are the issues with measuring it?
Let us examine three key dimensions of the problem:
- What the criteria that make up the measures of data quality
- What problems or challenges exist in using the metrics?
- How do you ensure that the metrics are being consistently applied, with a level of accuracy and precisions that is appropriate for the environment they are operating in?
The Three Questions
What are the criteria that make up the measures of Information Quality?
Leo L. Pipino, Yang W. Lee, and Richard Y. Wang, in their article: Data Quality Assessment (in the COMMUNICATIONS OF THE ACM April 2002/Vol. 45, No. 4) has identified a set of criteria that is as comprehensive and appropriate as any I have seen. They are:
| Dimension |
Definition |
Discussion |
| Accessibility |
The extent to which data is available or easily and quickly retrievable |
How does one define quickly, a day, an hour real time?
Where is it retrievable from?
How did it get there? |
| Appropriate Amount of Data |
The extent to which the volume of data is appropriate to the task at hand |
Who sets the volume?
What is the cost benefit ratio of the storage vs. the volume?
Is there a difference by domains in the volume? If so which do you select? Based on ROI or need? |
| Believability |
The extent to which the data is regarded as true and credible |
Do back office business analysts view the data with a degree of confidence comparable to business operations? |
| Completeness |
The extent to which data is to missing and is of sufficient breadth and depth for the task at hand |
Information is frequently provided for the minimal number of fields to complete a transaction. Yet this may not be sufficient for repurposing the data for other business functions. |
| Concise |
The extent to which data is compactly represented |
The use of standard reference data and enterprise wide accepted values, (such as code values), will enhance data consistency and understanding |
| Consistent Representation |
The extent to which data is presented in the same format |
Do parallel or silo applications represent key data identically? This is the driver behind MDM to reconcile those variations and remediate. |
| Ease of Manipulation |
The extent to which data is easy to manipulate and apply to different tasks |
Data structures, timings, business rules may act to restrict or inhibit broad distribution across the enterprise. |
| Free of errors |
The extent to which data is correct and reliable |
Often data quality does not impact all its usages in an enterprise. Errors are specific to a business need thus may be acceptable and not acceptable based on its context. |
| Interpretability |
The extent to which data is in appropriate languages symbols and units and the definitions are clear |
Metadata definitions as well as format need to be published and if possible consistent across databases. |
| Objectivity |
The extent to which data is unbiased and unprejudiced and impartial |
Strict controls over data manipulation frequently are not in place. We have seen desktop analytics where analysts revise data extracted from operational systems, thus impacting the integrity and objectivity of the specific data. |
| Relevancy |
The extent to which the data is applicable and helpful for the task at hand |
Data may not use the same business rules, taxonomies, etc. When repurposed. This may be unavoidable due to implementation factors such as package solutions. |
| Reputation |
The extent to which data is highly regarded and in terms of its source or content |
Data quality across an enterprise is rarely viewed comparably by multiple users. The impact of efforts to remediate data problems should be reflected in its reputation over time. |
| Security |
The extent to which access to data is restricted appropriately to maintain its security |
Are specific business rules around data access established? Are they maintained? Are they realistic due to compliance and regulatory needs or artificial? |
| Timeliness |
The extent to which the data is sufficiently up to date for the task at hand |
Often data due to architecture restraints or business timings does not meet timing requirements. Particularly when analysis functions and methodologies become more sophisticated and closer to current day or even real-time. |
| Understandability |
The extent to which the data is easily comprehended |
Effort to analyze and comprehend source system data is effected by factors such as The degree to which data from a source system needs to be interpreted due to factors such as lack of availability of metadata and business rules. |
| Value Added |
The extent to which data is beneficial and provides advantages from its use |
Again the data may or may not add value to other applications outside of its originating source system |
To the list above I will include the following:
- Context – as we discussed earlier data becomes information when structure and context are added. Information is unfortunate subject to the subjective and frequently personalized interpretations of the person evaluating it. Their individual experiences and knowledge are prone to introduce a filter that can change meaning. Therefore it is important to know who is using the data. For example, sales volume can be measured in different units and therefore to say that sales for August were 100 we must add is this unit, Gross, Cases etc. The context adds meaning.
- Accuracy and Precision – this will be apparent to the engineers reading this. Allow me the classical definition. If I have a foot ruler that is 12 inches long, but because of an error in manufacturing each inch is a 12 of an inch longer then I will have a foot ruler that is 13 inches long, my measurements will be precise but inaccurate. By the same token if my ruler is divided into tenths of an inch I cannot make a measurement that is more accurate or precise then a tenth of an inch. Therefore my precision can only be as precise as the ruler is. The same is true for corporate data. It can only be as accurate and precise as the most inaccurate and imprecise source. Introducing additional precision in representation is of no use and introduces false confidence in the numbers.
What problems or challenges exist in using the metrics?
Regardless of the elegance of the software engineering and the precision of the processes which calculate data quality metrics, any measurement system within a complex system is subject to interpretation. As alluded to in this paper, inadequate data quality for a specific business process does not necessarily apply to all consumption processes for that information. Thus generalizations about data quality and correctness are suspect. For example an engineering group requires resolution at the thousands of an inch, while procurement, inventory, distribution uses the data where this level of precision is not material. Yet the part has to have consistent attribution value for other values.
In defining metrics we offer the following guidelines:
- Relevance to the Business Process: The metric must be defined within a business context that clearly explains how the metric score correlates to enhanced execution of the associated business process
- Resolution of Measurement: An enabling measurement process must quantify a quality metric within a discrete range and be subject to applying minimum quality value standards (may be specific to a business process or at an enterprise scope)
- Controllability: The metric must reflect a controllable or manageable aspect of a business process thus supporting a well defined method for remediation or communication and tracking.
- Communications: Workflow mechanisms need to gather and transmit the appropriate information to a data steward (assuming that concept is enabled) when the measured value falls below acceptable quality levels
- Historical: Tracking the historical patterns and values of the metric must be compatible with determining whether the remediation process is effective as well as supporting statistical process controls
Even with well defined and designed measurement processes there are further considerations. How does the data quality process establish what is the acceptable tolerance in accuracy and precision? What exactly is accuracy? Are quality tolerance levels global or specific to a business need? The message is that one solution does not fit multiple enterprises; in fact it may not even be able to span multiple business units or departments within a business unit. Data Quality is inherently subjective and must be specific to business requirements and prioritization.
Complexity is further compounded when source application data is determined not be meet quality standards by an upstream process. For example an order entry application data is rejected by a transformation process for an enterprise data warehouse. What is the remediation approach? Is it viable to have an automatic or manual workflow feedback loop, suspense process filtering data problems back to specific business departments or applications? Or can remediation be handled by a mechanism such as a master data management process or data quality process within the EDW environment, thus not communicated back to the originating source system.
David Loshin writing about “Data Quality and Cost Reduction” (a Dataflux white paper) states: “There are many potential areas that may be impacted by poor data quality, but computing a precise cost is often a challenge. Classifying those impacts helps to identify discrete issues and relate business value to high quality data. Evaluating the impact of any type of recurring data issue is easier when there is a process for classification that depends on a taxonomy of impact categories and subcategories. Impacts can be assessed within each subcategory, and the quantification and reporting of the value of high quality data can be rolled up as a combination of separate measures associated with how specific data flaws prevent the achievement of business goals.”
How to ensure that the metrics are being consistently applied, with a level of accuracy and precision appropriate for the environment they are operating in?
Without formal data governance in place ensuring consistent and appropriate metrics usage will not be achievable. Even with this in place there must be an explicit understanding that not all data quality needs span the enterprise. Yet I would contend there does need to be a minimum acceptable quality threshold at the aggregate level. It is becoming clear that data standards are inherently decentralized in a modern enterprise. Implied by this condition is the accountability of data stewards, architects and other data quality stakeholders to embrace heterogeneous business requirements based on the specific needs of the data consumption business processes.
Policies and workflow to handle the data quality remediation case (governance process) need to in place and continuous. These processes themselves would benefit from applicable metrics to validate that they are being performed and then that the need is diminishing as the enterprise reaps the results of its data remediation programs.
Data consumers have different needs e.g. – manufacturing/engineering needs versus marketing. Or precision requirements maybe based on a variety of factors such as format or timing. e.g. – inventory requires real-time for product reservations while purchasing looks at aggregations across transactions. A business consists of varying needs and consistency is relevant to the consuming application. An upfront data collection or order entry function may not care whether a customer is a new or existing, therefore does not embed a function to reconcile master data. Meanwhile that data moves to sales and marketing and customer service and the uniqueness of that customer becomes critical.
Summary
The quality of an enterprise’s data impacts its ability to execute, as deficiencies compromise its understanding of its performance. Measuring data quality is a non-trivial undertaking. This paper identifies a comprehensive set of criteria on which metrics can be established, measured and managed. These establish a breadth of impact far beyond traditional data quality viewpoints, thus reflecting the dependency on accurate and precise data across the enterprise.
There are differences in the context of how data is used across departments or business units thus implying a variety of quality requirements. There needs to be processes established which serve the business in facilitating the definition, usage and management of data which address these conflicts and others. Processes and enabling workflows to remedy data errors due to quality issues must take into account what is appropriate for the business and be requirements driven. For example should source operational system data be overwritten, deleted or should the corrected data be resident in a different architecture layer such as an enterprise data warehouse.
These decisions and other significant conflicts and issues in executing a data quality program are not necessarily technical. They are closely aligned with the establishment of formal and proactive data ecosystem infrastructure, specifically data governance, stewardship and continuous data quality processing.
The assessment and institution of an effective data quality program goes well beyond the accuracy and precision of data. Characteristics such as context, extensibility and completeness come into play. Enabling solutions go well beyond the implementation of automated processes which cleanse data and isolate potential data errors. Metrics which need to be considered should address a specific set of requirements, be historical in nature and be monitored continuously.
Filed under: Uncategorized | 1 Comment »