When businesses are data-driven, poor data quality management affects business outcomes. Many business analysts are still failing to make the link between data processes and business processes in a way to make a real operational difference. There are a number of ways to start understanding and acting on business data quality issues:
- Recognise the real costs of data negligence
- Learn some heuristics about database systems
- Acknowledge that the data pipeline can be organised much like any other product manufacturing process
- What are the Business Costs of Bad Data Quality?
- What are Common Data Quality Issues?
- How Data Systems Lead to Bad Data Quality
- How IP-MAP Assists Data Quality Management
- An Information Product Approach to Data Quality Management
What are Costs of Bad Data Management?
According to the Data Warehousing Institute, a single specific type of bad data in business (issues with name and address fields) cost US businesses $611 billion dollars in wasted costs. Their research also identified that
Most organisations overestimate the quality of their data and underestimate the impact that errors and inconsistencies can have on their bottom line.
Expanding the scope of the potential data quality issues, it is estimated that “the business costs of nonquality data, including irrecoverable costs, rework of products and services, workarounds, and lost and missed revenue” can be up to between 10 - 25% of an organisation’s total revenue and around 40% of companies report suffering a problem or a loss due to bad data.
This stems from an issue that I’ve written about many times before on this blog, the inability to take data seriously as a critical business resource. In increasingly data-driven organisations, if you have a problem with bad data, you have a problem with the financial health of your organisation. Some scenarios that make the direct link from data quality issues to lost revenue include:
- A bank incorrectly calculating profitability because of missing cost data.
- Duplicate customer records leading to wasteful redundant mail being sent out for hundreds of thousands of customers.
- A financial institution calculating loans incorrectly, misreporting the principal and alienating their customers.
What are Common Data Quality Issues?
When thinking more specifically about real monetary losses caused because of data , there are several issues that commonly occur and are an inherent to poor data quality management of large customer databases. These are a major impediment in creating a single customer view, which is a powerful tool for customer relations management in any business.
One of these is the problem of customer matching which occurs due to overlapping, poorly integrated systems siloed across different departments. Depending on how bad this problem is, it can cause a mass multiplication of customer entries in a database and the longer the problem goes on the larger and more difficult the affected database will be to cleanse.
There is also the issue of corporate house-holding, in which there is an indistinct relationship between an individual and the household they represent. In many cases, it is important to know whether a new customer entry exists within the same household as a previously existing customer. Not being able to reconcile this leads to inefficient deployment of resources and hampered marketing campaigns.
Additionally, a less common but still significant issue is organisational fusion occurring during a major restructure or merger which calls for the combination of two or more customer databases. Without proper data cleansing and matching, this can create a consolidated database which is full of duplicated entries.
There are certain frameworks and concepts that can help you understand how a database system affects the quality of the data and how you can take the step from thinking of its contents as less of an inert substance and more of a business asset. This is also a prerequisite to setting up an effective data quality management system.
How Data Systems Lead to Bad Data Quality
A common way to think about the architecture and its relation to data quality is defined in the very basic principles of Distributed Database Systems. It relates to the fundamental ways in which data is “collected, stored, elaborated, retrieved and exchanged” within a system. To better understand what is meant by high quality and effective data, an information system can be falling somewhere within three axes which represent the levels of distribution, autonomy and heterogeneity.
Distribution relates to how data is stored and accessed between machines in a physical network and the roles that are attributed to each. This ranges between a distinct client/server dichotomy to a fully distributed or peer-to-peer distribution, in which there is no meaningful distinction between client and server. One of the main implications of a P2P system is that it is much more decentralised, offering no hierarchy, no global view of the system and no real method to ensure adherence to universal data quality standards.
Autonomy represents a measure of how independent each individual database is and whether it relies on some level of interoperability in order to present valid and useful data. The spectrum of autonomy begins with tight integration in which numerous database management systems need to operate in concert to present data, with a complete map of all the relationships a prerequisite to doing so. At the other end is a totally isolated system in which multiple databases are not mapped together and do not rely on each other in the provision of services. Not understanding the bigger picture in a highly integrated system can cause a loss of important context or incomplete data, leading to quality issues.
Heterogeneity is a looser classification relating to the variation in hardware, semantics, protocols, models, and query languages that are used within a system of databases. Although there has been a significant degree of standardisation throughout the history of database technology, there are still many different tools, approaches and methodologies that can be used to achieve similar ends. For example, slight semantic differences in vendor-specific SQL queries can create different results if not accounted for.
A Data Warehouse, for instance, is defined by very high heterogeneity and low on the other aspects since its main function is to bring together and integrate data from a large variety of sources into a centrally hosted and well-mapped architecture. A peer-to-peer network will rank highly on all the axes, due to the fact that it can be entirely decentralised and unmappable.
Understanding where a database system falls within these axes is a way to understand where data quality issues might be occurring and how to realistically go about trying to address them. As a business analyst, this may be the level at which it helps to understand the systems you are working with before you start considering how to approach the data quality problem as a business problem as outlined in the next section.
How IP-MAP Assists Data Quality Management
As already stated, understanding data as a business asset is an important step in ensuring that you are getting the most value out of it. There have been several attempts to formalise and develop a consistent way of thinking about data as a tangible business process.
One of these is the Information Product Map (IP-MAP) which aims to represent business data as a product, similar to something that would be produced through manufacturing. With this approach, there is an equivalence made between the quality of a manufactured product and the quality of data, with models and methodologies that have been developed for the former, being also being applicable to the latter.
The IP-MAP is a way of graphically representing, as a business process, how something like a report or a visualisation comes together. It frees up data from being mapped purely on a technical level and opens up engagement with the process to non-technical staff. It gets more people involved in systematic thinking around data. IP-MAP consists of 8 “construct blocks” which are used to flowchart the creation of an “information product”. These blocks are Data Source, Output to Customer, Data Quality Check, Calculation/Processing Point, Data Storage, Decision Point, Business Boundary and Information System Boundary.
An Information Product Approach to Data Quality Management
According to the creators of the IP-MAP system, the benefits that the approach offers are:
- It helps identify the most important phases of a process that have a potential to affect data quality.
- Thinking of an “information product” not only at a technical level can help identify bottlenecks and inefficiencies.
- It helps identify ownership of each stage of a process, helping to implement “quality at source” and continuous improvement principles.
- Managers can benefit from understanding where business unit boundaries and information system boundaries are and how this distinction affects the final product.
- The “information product” quality can be gauged at different stages of this mapped process according to different dimensions.
There are many ways to approach data quality as a business problem and this blog has presented one potential set of aspects to consider: understand that lack of action on data quality can have real revenue costs, be aware of the general characteristics of the database, formalise the creation of “information products” by thinking of it in terms of more of a conventional manufacturing process problem.
If you are looking for data quality management tools that are specifically built to help you have better governance over your information ecosystem and resolve data quality issues, sign up for a free trial of Loome.