Blogging all things data

Data Warehouses, Data Lakes and the Importance of Data Lake Governance

Data Warehouses, Data Lakes and the Importance of Data Lake Governance

Good Data Lake governance is a vital part of maintaining a healthy and well-functioning data ecosystem. This blog post will cover what you need to keep in mind when moving from an on-premises to a cloud database as well as the things to keep in mind when implementing a Data Lake to prevent it from turning into a Data Swamp.

Establishing a Data Strategy

When you are in an organisation dealing with high volume big data analytics and have decided to move from on-premises to the cloud, there are important decisions that need to be made. You know that your requirements are broadly:

  1. A dependable single source-of-truth for your data. 

  2. Fast turn-around-times for querying and processing.

Even with these seemingly straightforward requirements, you have several options available and no single one-size-fits-all solution.

In order to understand what the options mean and why the distinctions are important, it is wise to get an understanding of the unique data situation within your organisation.

Specifically, what matters is whether you are dealing with structured data or unstructured data, the purposes for which you will be querying the data, as well as the ongoing resources you are prepared to commit to data stewardship and data lake governance. Thinking about these will determine whether a Data Warehouse or a Data Lake is a better fit for your organisation.

bad data lake governance hidden within an orderly server roomIt may not look like it, but data swamps dwell within

Types of Data

A big part of deciding what solution is best for your organisation ultimately comes down to the form your data takes: structured or unstructured, as well the volume and velocity with which it will enter your storage.

Structured data takes a determined, predictable form, broken down into elements that follow a predefined model. This type of data can be used to setup a relational database, which can be predictably and consistently queried with a tool such as SQL.

Unstructured data does not have this organisation, lacks common fields and is irregular, making it much more difficult to process and query. This type of data can include blocks of different sizes and formats, making them inscrutable to SQL.

Semi-structured data is a mixture of both, wherein there are certain defined classes and categories, but not to the same extent as fully structured data.

More Considerations

Another important aspect to consider is how your data in storage will be accessed and the purposes for which it will be used. If you are going to be regularly accessing a large dataset or pulling insights from it to feed into your business intelligence / analytics reporting tool, it is going to influence your optimal data solution.

A final consideration is your organisation’s data governance plan and resourcing capacity. Although sometimes it may be possible to build a “fire and forget” data strategy which is deployed with confidence and accounts for all future contingencies, in most cases there is need for a dedicated data steward role or team which will be responsible for ongoing data governance and re-evaluation of requirements. This is often an aspect which is often not given sufficient scrutiny when determining a data solution.

Making Data Useful for Business

It is tempting to see your organisation’s data as abstract and impersonal, existing within its own sphere and affected only by technocratic management. However, data ultimately exists to inform business decisions and is only useful when digested in the right form by the right people. Effective, ongoing data governance is what ensures that your data is a nourishing wellspring rather than a stagnant swamp.

When dealing with data which is highly structured, consists of well-defined fields and is expected to be consistent, a Data Warehouse will typically be the go to choice. Data Warehouses commonly consolidate numerous data sources and perform queries using some variant of SQL. This enables it to be a reliable source-of-truth for a business, since the data follows a single schema and structure. Additionally, because of this, a Data Warehouse can underpin dependable single analytics reporting platforms. This highly structured and organised approach is the reason for the benefits but is also the cause of several drawbacks.

The Data Ingestion Process

Before organisational data can be fed into a Data Warehouse, it has to be made to fit the schema through an initial ETL (Extract, Transform and Load) process, which is a form of data homogenisation.

Each step of this process typically represents a set business objective and is aimed at ensuring that the incoming data will be not only usable, but also useful. This process is called schema on write and creates a necessary preliminary investment in time and resources in creating the structure and format which will set the data up for fastest possible querying and analysis.

Establishing a schema and scrutinising everything coming through the ETL process is a way to make sure that the data pipelines are purposeful, well understood and in line with organisational needs. A benefit of this rigidity is that working out this data schema in advance incentivises the good data governance. However, because it is such an involved process, once it is in place it cannot be quickly changed based on evolving requirements.

Once established, an organisation is locked into a certain Data Warehouse structure. Any future changes would require a further investment in retooling costs.

Furthermore, as the overall data landscape has evolved, it has trended towards larger volumes of less structured data in formats that defy structuring. Streaming media, IoT sensor data, geolocation info, page navigation metrics and many other new data streams present a challenge in building and maintaining an ETL process which will produce reliable structured data.

an internet of things sensor which produces unstructured data

Sensor data from the Internet of Things (Iot) represents a challenge for structured databases

The Data Lake Option

An alternative is to defer the upfront costs of integration by implementing a Data Lake, which represents a specific approach to organisational data. A Data Lake generally has a flat architecture, ingesting data in its original format, completely raw and without the need for making it fit a schema. Instead of a full initial ETL process, only the Extract and Load steps take place. The Transform stage is pushed back to the point when the data is accessed.

One of the benefits of a Data Lake is that it requires much less time and resources to set up. You can have one up and running in a comparatively short amount of time. Additionally, there is no need for data siloing or the creation of Data Marts.  A Data Mart is a smaller-scale Data Warehouse which stores data only for a single department or segment of the organisation and typically prevents big-picture analysis.

With a Data Lake, all organisational data is stored together, regardless of type or format, providing analysts and data scientists with a free-fire zone to access, refine and explore a potentially broader and deeper range of data.

Especially when dealing with logs and sensor outputs, a Data Lake is a quick, flexible and cost-effective way of mocking up models and performing preliminary data analysis. The schema on read aspect of a Data Lake structures the data as it accesses it, opposite to how it would be handled in a Data Warehouse. This ease of input means that a Data Lake can be used to store data, the value and utility of which is yet to be established. It can also act as a middle-point and staging area between a stream of unstructured data and, ultimately, structured data housed in a Data Warehouse.

Data Swamps

This flexibility, however, comes with its pitfalls. Since there is so little friction in adding to a Data Lake,

the temptation to shoot first and ask questions later will always be there.

A Data Lake that has been used as a dumping ground, sucking in an organisation’s data without any strategy for how it will be used, is colloquially referred to as a Data Swamp. On top of the opaqueness of the data situation within a Swamp, there are additional drawbacks which can include compromised data security and difficulties in implementing effective user access controls.

The main way to prevent a Lake turning into a Swamp is to incorporate good data governance practices within your organisation and resist the temptation to defer thinking about the data until the future. Having a designated data steward, responsible for monitoring the entire data lifecycle and implementing a data lake governance strategy, will go a long way towards ensuring that your organisation has visibility and control over the Data Lake. Given the heterogeneity of the data, ad hoc solutions will be called for in data discovery, data cleaning, data integration and more. Acknowledging this and dedicating resources to account for it gives you the best chance of reaping all the benefits of a Data Lake while minimising the possibility of encountering any of the downsides.

a beautiful swamp is a metaphor for a poorly governed data swamp

Swamps can be beautiful, Data Swamps, not so much

Metadata and Good Data Lake Governance

Since most of what is stored in a Data Lake lacks the context provided by structured, relational databases, reliable metadata is very important in maintaining good Data Lake governance. There can be many variations of metadata, however according to NISO it is divided into three main types: descriptive, structural and administrative.

Descriptive metadata aids in discovery and includes human readable tags and labels which can help in search and categorisation.

Structural metadata refers to where a particular element exists in relation to others, helping to establish a bigger-picture view of the data landscape. This is important for keeping track of data lineage as well as creating rational navigational paths between elements.

Administrative metadata is a broader category which encompasses everything from the technical details needed to decode the files to information about a particular license the data is covered by.

Good Data Lake governance, as performed by the data steward, would primarily ensure that as much of this metadata was being captured in a usable form. Subsequently, this metadata can be used to verify ownership, lineage and categorisation as well as enable more well-regulated user access and easier compliance with any relevant requirements. Despite the lack in full transparency that structured data offers, a well-curated metadata layer can come close to the benefits in data command and control that organisations typically seek.

If you’ve decided that a Data Lake is the best data analytics solution for your organisation, having a better understanding of some of these details will help ensure that you are getting the most out of it.

Watch the BizData webinar Accelerating the Benefits of a Data Lake for some relevant insights from our Director of Big Data Engineering, James Bashforth.