Blogging all things data

The Costs of Data Redundancy and Data Inconsistency

The Costs of Data Redundancy and Data Inconsistency

A structured database is generally a solid and dependable storage option for client information, however, when you are dealing with numerous customer channels the quality of your database can become compromised with data redundancy and data inconsistency through no fault of the system. The more channels and points of input there are, the worse the situation potentially gets. 

Even modestly sized organisations can suffer from customer data fragmentation. Users are providing their details through web forms, the sales department is entering leads, tech support is creating tickets, accounts and billing are generating masses of transaction information. Information about the same client is entering the database through different tools which use their own proprietary data formats. It is not a straightforward task to ensure that when sales and billing refer to the same customer this is reflected in the database. There are many ways to structure a database and work towards data standardisation, however, some are more effective than others. 

"Human creativity always has the power to be surprising, especially when it comes to the ways in which it can defy well defined input fields"

Additionally, human creativity always has the power to be surprising, especially when it comes to the ways in which it can defy well defined input fields. Different types of data verification and data validation can help smooth out some of the worst offenders. Nevertheless, an individual at the point of interface will continue to stymie efforts at data standardisation. What eventually emerges from flawed user input and imperfect database design are problems of data inconsistency and redundancy. This post will cover what redundancy and inconsistency mean for a database, what the relationship is between one and the other as well as what can be done in order to alleviate their effects and achieve customer data integration.  

data redundancy in a spreadsheet displayed on a laptop

What is Data Redundancy?

Data redundancy occurs when the same data point is multiplied across the database and can be found repeated in an unnecessary form.

It usually stems from either poor initial relational database design, wherein information is inefficiently structured and needlessly replicated within the same table. Additionally, it can happen over time as the database is expanded to factor in new elements with insufficient planning in how to implement them as efficiently as possible.  

A common example of data redundancy is when a name and address are both present in different columns within a table. If the link between these data points is defined in every single new database entry it would lead to unnecessary duplication across the entire table. It would be much more efficient to define the name/address relationship in a separate related table. 

Types of data anomaly

Depending on the size of the database, over time this would potentially increase data storage and processing requirements. This can also lead to an overly complicated web of table relations, which would likely lead to confusion during the querying process. Data updates might also not take effect across all relevant points, leading to data inconsistency. Another form of this redundancy would be if identical information was stored across several different unrelated tables. This is known as an update anomaly. It would likely occur due to the database being built in multiple stages, possibly with disjointed handovers and a poor overall architectural plan. 

There are two other cases stemming from poor database design that lead to severe issues in ongoing database operations but cannot really be classified as a pure problem of redundancy. The first is an insertion anomaly, in which a relational model is so locked down and rigid that a new addition which falls outside the expected scope cannot be included either without substantial re-engineering or, as is more common, an ad-hoc solution which will typically introduce data redundancy errors down the line. 

The final case is known as a deletion anomaly, wherein a database is built in such a sub-optimal way, that the deletion of a final data point in a table leads to unintended data loss of an element not located anywhere else. An example of this is if important address details are found within a user table. When a user is deleted, the address details could be completely lost.

How to minimise data redundancy

A way to nip redundancy issues in the bud is to devote more time to planning out more efficient database structures before they are implemented. If that is no longer possible, then a process of database normalisation would have to take place. The aim of the database normalisation process is to reengineer tables so that the purpose of each is well defined and that the relations between them are purposeful and logical. This process also aims to set up the database in such a way that it is scalable, potentially expanded or retracted in the future without creating insertion anomalies or deletion anomalies. 

data inconsistency in the database shown on a laptop screen

What is Data Inconsistency?  

Data inconsistency is a situation where there are multiple tables within a database that deal with the same data but may receive it from different inputs.

Inconsistency is generally compounded by data redundancy. However, it is different from data redundancy and its adjacent anomalies in that it typically refers to problems with the content of a database rather than its design and structure. This is where the existence of multiple channels and touchpoints as well as the human propensity for putting creative spin on inputs begins to compound database problems.  

An organisation is broken up into different departments, each using their own tools and systems, each following their own processes and with their own interpretation of the data points they are creating and using. When connecting these into the central database, a process of data integration needs to happen which tries to homogenise the differences in proprietary data models by cleaning and transforming the data as much possible with validation and semantic rules. 

"An organisation is broken up into different departments, each using their own tools and systems, each following their own processes and with their own interpretation of the data points they are creating and using."

However, data quality is not purely a technical problem that can be fixed during data integration. When you are dealing with semantic and definitional questions across departments, it is almost impossible to achieve a consensus on certain business terms let alone implement a standardisation policy. The differences in data requirements across tools and departments mean that even the most comprehensive data integration focus will never consolidate into a single master, unduplicated database.  Additionally, even if an organisation implements a Master Data Management program, dependable old human error will still be the cause of a certain level of data inconsistency.  

Ways to minimise data inconsistency

There are two approaches that are increasingly being implemented by organisations in order to tackle the problem of data inconsistency across applications.  A central semantic store approach is one that involves focusing on meticulously logging and storing all the rules used by the database integration process in a single centralised repository. This is aimed at making sure that as data sources become updated or new ones are added they do not fall outside data integration rules. 

A master reference store approach tries to solve the data consistency problem through greater centralisation. Specifically, it aims to create a main source-of-truth for the most important reference data and creates strict rules about syncing all secondary tables when a change is triggered in the main one. Instead of standardising data on an ad-hoc, partial and situational basis, this approach locks it down into a single central process in an attempt to keep greater control over the most important data points, even if it does come at a greater use of processing resources. 

 

The potential of AI 

No matter how an organisation decides to work towards customer data integration, AI has the potential to greatly expedite the process in certain ways. Regardless of whether reference data updates will be centralised or all integration rules will be carefully monitored, there will come a point at which it will be important to do an overall audit of a database to clean and consolidate the data, providing a solid base upon which to build. 

Ironing out database inconsistencies manually is an extremely long and tedious task. Even with the use of some fuzzy matching models, results that are more than a little bit dissimilar risk not being picked up. As such, this task is ripe for the application of machine learning. By building a ML model which is fed a substantial number of typical data deviations, differing interpretations and misspellings, an underlying database normalisation system can be established to iron out a significant amount of the inconsistencies. As with any machine learning application, investment in model creation, training data and human oversight would be an important of the implementation. However, the potential gains for an organisation would be significant. 

If you want to find out more about creating a unified customer database, tune into the BizData webinar on 360 Degree Customer Analytics. Also, check out the second part of this blog, which goes into more detail about what it means to have a single customer view and the specific benefits it has for businesses.