Lost or incomplete data can be a nightmare for any data-driven organisation. It is even more harrowing when it is discovered that data was lost or had been distorted over a period of time without anyone being aware of it. Decisions made based on the flawed data are thrown into doubt and the organisation can suffer a period of strategic whiplash as it readjusts to new benchmarks, performance metrics and projections. These blows are only compounded when the data involved is health data, highly sensitive by nature and directly involved with the physical wellbeing of real people.
Public Health England, an agency of the UK Health Department, experienced such a data loss scenario in a very public way, leading to an abject lesson in the importance of having well tested data pipelines and loading data in a transparent and well-managed way. At a time when the eyes of the world are particularly focused on data to do with COVID-19; daily figures, trend lines, clusters, demographic data, the agency was forced to admit that the numbers people around the world were scrutinising every day were just plain wrong.
Between 25th September and 2nd October Public Health England under reported national COVID-19 cases by 15,841. This was at a time when cases were increasing in the UK. A clear understanding of where they were occurring and where the trends were most pronounced was more important than ever in attempting to analyse and curb the spread.
Having incomplete data on cases meant that contact tracing and notification could not be performed for all who had tested positive. Without this, infected individuals were more likely to carry on as normal and spread the infection even more throughout their communities. It was estimated that up to 50,000 contacts were not traced or acted on in time directly because of the data issue. Furthermore, the data that was going into the government’s complex scenario modelling was out-of-touch with reality and had to be comprehensively revised when the updated numbers were received.
Organisations can suffer real and significant costs due to data quality issues, but rarely are the stakes so high and involve such a direct impact on the health and wellbeing of thousands of people.
How did the Data Loss Happen?
How Spreadsheets Impact Data Quality
The culprit in this case was the humble spreadsheet. We have blogged before about how spreadsheet software has been called the “hidden productivity killer” and even “potentially the most dangerous software on the planet” because of the many issues of financial planning using spreadsheets. This is because they are so easy to use and empower even the basic user to perform rudimentary data tasks with relative ease.
Unfortunately, this can lead to overconfidence and overreliance, wherein a system which was designed to handle low volume, basic data with a little computation and a high tolerance for error is increasingly relied upon to handle large, complex datasets. At some point when the data task becomes sophisticated enough and the stakes large enough (such as when you are dealing with financial health or physical health), a suitably capable and reliable data pipeline should be deployed.
How Legacy Systems Impact Data Quality
In the case of the COVID data, the use of spreadsheets collided with another frequent cause of poor data quality in large organisations, dated legacy systems. Data was being received from the various testing centres in XLSX format, which is perfectly capable of handling over a million rows of data. However, due to requirements that involved supporting old health database applications, the process for loading the disparate data into a master database relied on converting to an obsolete data format – XLS.
The problem with this was that the older data format could only handle around 65,000 rows of data per file, meaning that many thousands of rows of data were being cut off and lost every time a large enough spreadsheet was ingested. Since each documented COVID case consisted of many rows of data, files were consistently going over the XLS threshold and were being cut off. This was occurring without any alerts, logs or errors being created in order to notify the operators. Because of this the error was allowed to keep occurring for a week, compounding the flawed data situation every day.
Even when the problem was uncovered, the solution was to introduce a manual file breakup and loading process which would check every file to ensure it was under the file row limit. While introducing the manual processing step may have served to manage the issue in the short term, such band-aid fixes, workarounds and manual interventions can only go so far before they create bottlenecks and problems of their own.
How Could Data Loss Have Been Avoided?
Data Readiness Assessment
A Cambridge professor of data science was quoted as saying in relation to the story:
Excel was always meant to for people mucking around with a bunch of data for their small company to see what it looked like. And then when you need to do something more serious, you build something bespoke that works - there's dozens of other things you could do.
Having the right tool for the job is important and applies to any data-oriented project as much as it does anywhere else. A comprehensive assessment, involving multiple relevant stakeholders, of each stage of the data pipeline as well as the inputs, file types, touchpoints, transformations and visualisations can typically identify points at which significant problems may occur and rectify them in advance. You can read more about Data Readiness Assessments in this eBook on effective Data Governance.
Testing Data Pipelines With Real Data
This case also highlights the importance of testing data pipelines using real-world data scenarios. If you are using small test dataset that don’t accurately represent what the data pipeline might realistically experience in some fringe, high-volume cases, you will not get an accurate representation of its performance and potential shortcomings.
Implementing Data Quality Alerts
Finally, it is a lesson in the fact that data quality alerts are extremely valuable in quickly identifying and acting on database problems before they become compounded over time. Having a predefined idea of expected outputs and testing them in real time against inputs gives you some measure of transparency over how the data pipeline is functioning and notify you when there are exceptions, mismatches or other unexpected behaviour in your data stream. Business alerts can also be set up using a similar system, instead aimed at ensuring process compliance among users and operators. This ensures that no data quality problems are introduced during points of manual handling and manual data input.
Read more about how you can easily set up data quality software with Loome: