Good Data Lake governance is a vital part of maintaining a healthy and well-functioning data ecosystem in your organisation. To properly implement a Data Lake you have to prevent it from turning into a Data Swamp - a database where huge amounts of data has been dumped, without any real plan or system to ensure relevant and timely access. This is one of the main ways in which
To successfully achieve this you have to develop a strategy which takes into account your organisation's unique data situation. This involves thinking critically about your data to ensure that it treated as a useful business asset. Based on this you can determine whether a Data Warehouse or a Data Lake is the best fit for you. After that, you should start developing policies for good implementation of metadata and incentivising active, data-oriented thinking within your business. Implementing Data Stewards with well defined responsibilities is a part of keeping this process going.
What Types of Data are you Working With?
A big part of deciding what solution is best for your organisation ultimately comes down to the form your data takes: structured or unstructured, as well the volume and velocity with which it will enter your storage.
Structured data takes a determined, predictable form, broken down into elements that follow a predefined model. This type of data can be used to setup a relational database, which can be predictably and consistently queried with a tool such as SQL.
Unstructured data does not have this organisation, lacks common fields and is irregular, making it much more difficult to process and query. This type of data can include blocks of different sizes and formats, making them inscrutable to SQL.
Semi-structured data is a mixture of both, wherein there are certain defined classes and categories, but not to the same extent as fully structured data.
A Data Lake is particularly suited for storing data which is unstructured or has a less predictable structure. One of the main appeals of this storage solution is that it favours using cheaper storage resources rather than the more expensive compute resources. There is no need to transform the data into a structures relational format before putting it into storage. While this is a great strength, improper implementation can also make it a weakness.
Data Lake Governance Considerations
An important aspect to consider when thinking about storage is how your data will be accessed and the purposes for which it will be used. If you are going to be regularly accessing a large dataset or pulling insights from it to feed into your business intelligence / analytics reporting tool, it is going to influence your optimal data solution. High velocity and high volume analytics of a predictable kind lend themselves more to an OLAP data warehouse.
A final consideration is your organisation’s data governance plan and resourcing capacity. A “fire and forget” data strategy does not exist. Good data governance involves ongoing investment, upkeep and education. There is need for a dedicated data steward role and executive steering committee which will be responsible for ongoing data governance and re-evaluation of requirements. This is often an aspect which is often not given sufficient scrutiny when determining a data solution.
Not understanding the nature and purpose of the data being used and underestimating an organisation's capacity to implement change their overall attitude to data are common factors in the failure of Data Lakes.
How to Make Data Useful for Business
It is tempting to see your organisation’s data as abstract and impersonal, existing within its own sphere and affected only by technocratic management. However, data ultimately exists to inform business decisions and is only useful when digested in the right form by the right people. Effective, ongoing data governance is what ensures that your data is a nourishing wellspring rather than a stagnant swamp.
When dealing with data which is highly structured, consists of well-defined fields and is expected to be consistent, a structured and relational Data Warehouse will typically be the go-to choice. Data Warehouses commonly consolidate numerous data sources and perform queries using some variant of SQL. This enables it to be a reliable source-of-truth for a business, since the data follows a single schema and structure. Additionally, because of this, a Data Warehouse can underpin dependable single analytics reporting platforms. This highly structured and organised approach is the reason for the benefits but is also the cause of several drawbacks.
The Data Lake Ingestion Process
Before organisational data can be fed into a Data Warehouse, it has to be made to fit the schema through an initial ETL (Extract, Transform and Load) process, which is a form of data homogenisation.
Each step of this process typically represents a set business objective and is aimed at ensuring that the incoming data will be not only usable, but also useful. This process is called schema on write and creates a necessary preliminary investment in time and resources in creating the structure and format which will set the data up for fastest possible querying and analysis.
Establishing a schema and scrutinising everything coming through the ETL process is a way to make sure that the data pipelines are purposeful, well understood and in line with organisational needs. A benefit of this rigidity is that working out this data schema in advance incentivises the good data governance. However, because it is such an involved process, once it is in place it cannot be quickly changed based on evolving requirements.
Once established, an organisation is locked into a certain Data Warehouse structure. Any future changes would require a further investment in retooling costs.
Furthermore, as the overall data landscape has evolved, it has trended towards larger volumes of less structured data in formats that defy structuring. Streaming media, IoT sensor data, geolocation info, page navigation metrics and many other new data streams present a challenge in building and maintaining an ETL process which will produce reliable structured data. The next generation of retail analytics relies on processing very large volumes of data generated by numerous in-store sensors.
Sensor data from the Internet of Things (Iot) represents a challenge for structured databases
The Data Lake Option
An alternative is to defer the upfront costs of integration by implementing a Data Lake, which represents a specific approach to organisational data. A Data Lake generally has a flat architecture, ingesting data in its original format, completely raw and without the need for making it fit a schema. Instead of a full initial ETL process, only the Extract and Load steps take place. The Transform stage is pushed back to the point when the data is accessed.
One of the benefits of a Data Lake is that it requires much less time and resources to set up. You can have one up and running in a comparatively short amount of time. Additionally, there is no need for data siloing or the creation of Data Marts. A Data Mart is a smaller-scale Data Warehouse which stores data only for a single department or segment of the organisation and typically prevents big-picture analysis.
With a Data Lake, all organisational data is stored together, regardless of type or format, providing analysts and data scientists with a free-fire zone to access, refine and explore a potentially broader and deeper range of data.
Especially when dealing with logs and sensor outputs, a Data Lake is a quick, flexible and cost-effective way of mocking up models and performing preliminary data analysis. The schema on read aspect of a Data Lake structures the data as it accesses it, opposite to how it would be handled in a Data Warehouse. This ease of input means that a Data Lake can be used to store data, the value and utility of which is yet to be established. It can also act as a middle-point and staging area between a stream of unstructured data and, ultimately, structured data housed in a Data Warehouse.
What is a Data Swamp?
Flexibility in ingesting, however, comes with its pitfalls. Since there is so little friction in adding to a Data Lake, the temptation to shoot first and ask questions later will always be there.
A Data Lake that has been used as a dumping ground, sucking in an organisation’s data without any Data Lake strategy for how it will be used, is colloquially referred to as a Data Swamp. On top of the murkiness of the data situation within a Swamp, there are additional drawbacks. These can include compromised data security and difficulties in implementing effective user access controls.
The main way to prevent a Lake turning into a Swamp is to resist the temptation to defer thinking about the data until the future. Having a designated data steward, responsible for monitoring the entire data lifecycle and implementing a data lake governance strategy, will go a long way towards ensuring that your organisation has visibility and control over the Data Lake. Given the heterogeneity of the data, new ongoing solutions will be called for in data discovery, data cleaning, data integration and more. Acknowledging this and dedicating resources to account for it gives you the best chance of reaping all the benefits of a Data Lake while minimising the possibility of encountering any of the downsides.
Swamps can be beautiful, Data Swamps, not so much
Why is Metadata Important to Data Lake Governance
Since most of what is stored in a Data Lake lacks the context provided by structured, relational databases, reliable metadata is very important in maintaining good Data Lake governance. There can be many variations of metadata, however according to NISO it is divided into three main types: descriptive, structural and administrative.
Descriptive metadata aids in discovery and includes human readable tags and labels which can help in search and categorisation.
Structural metadata refers to where a particular element exists in relation to others, helping to establish a bigger-picture view of the data landscape. This is important for keeping track of data lineage as well as creating rational navigational paths between elements.
Administrative metadata is a broader category which encompasses everything from the technical details needed to decode the files to information about a particular license the data is covered by.
Good Data Lake governance, as performed by the data steward, would primarily ensure that as much of this metadata was being captured in a usable form. Subsequently, this metadata can be used to verify ownership, lineage and categorisation as well as enable more well-regulated user access and easier compliance with any relevant requirements. Despite the lack in full transparency that structured data offers, a well-curated metadata layer can come close to the benefits in data command and control that organisations typically seek.
If you’ve decided that a Data Lake is the best data analytics solution for your organisation, having a better understanding of some of these details will help ensure that you are getting the most out of it.
Watch the BizData webinar Accelerating the Benefits of a Data Lake for some relevant insights from our Director of Big Data Engineering, James Bashforth.