Having precise data is valuable for any organisation. However, its utility can be limited unless it is timely, representing the most up-to-date snapshot that is possible. Depending on the database and how it’s designed, data refresh may happen at different intervals, with reports and visualisations being updated only at certain predefined times. While in many situations periodically updated reports and dashboards are entirely suitable, sometimes there is a need for fresher data, presented as close to real-time as possible. What does it mean for data to be real-time? How is it different to streaming data? This blog will explore these questions.
The Challenges of Real-Time Analytics
Commonly, the process of transforming this data into a single, consolidated format that is able to be visualised involves extracting and transforming it. It is put into a single standardised schema before being written to durable (as opposed to “in-memory”) storage in a relational data warehouse where it can then be batch queried. Data refresh processes are often scheduled to run periodically, normally overnight.
This intermittent refresh of data may be sufficient in many cases where data analytics is required, specifically where the insights are not time sensitive and data is aggregated from whole, stored datasets. Increasingly, data is required to be processed and consumed in a timelier manner. Due to the evolving underlying technology as well as expectations of the end-user, data that was previously delivered with latency, is now often expected almost instantly. The idea of real-time data analytics has evolved, and it is important to clarify this evolution in order to have a clearer picture of how we understand streaming data.
Types of Real-Time Data Systems
Data systems that operate in real-time have always been around. They have conventionally been divided up into three categories.
A hard real-time system has a latency period measured in milliseconds and has no tolerance at all for delay. This means that any disruption can lead to total system failure and, depending on the system, potential loss of life. An example of this might be the system which controls braking and steering in a car.
A soft real-time system has a slightly higher latency time measured in the seconds. In this case a delay in transmission does not mean catastrophic failure but may lead to nonfatal glitches and inefficiencies. An example of this might be the system which controls the buying and selling of stocks.
Finally, a near real-time system has a latency tolerance that extends into the minutes with delays not representing a significant degradation of overall system function. Analytics reporting tools and many other data delivery systems typically fall into this category.
The Evolution of Real-Time Analytics Systems
As real-time systems have increasingly become available to consumers, the line between soft-real time and near real-time is blurred and breaks down at the point of data consumption, rendering the distinction not-so-useful. This is due to increasing use of technology such as wi-fi which may complicate latency measurements. Furthermore, relying on measuring response time as a determining factor of a real-time analytics system is unhelpful because it doesn’t factor in the architecture behind it and how the system itself is structured.
Because of this, a more modern way of looking at the breakdown soft and near has been to conceptualise it as streaming data, (as opposed to batch data) in which analysis happens in-flight, meaning that it is never committed to durable storage and is always subject to continuous queries. The processing is based on never having access to a complete entirety of a data set.
What is Streaming Analytics?
Streaming data can be defined by several characteristics:
An important factor is that it is always on, being constantly updated and is continuously available. Because of this, the throughput and dependability of the collection and analysis system needs to be adequate. Since data is not being channelled into long term storage, any down time in these systems will usually result in data loss.
This is also a double-edged sword when it comes to the issue of applying conventional statistics, which operate with discrete and whole data sets and would not necessarily be applicable to continuous, ongoing data stream.
Non-fixed data structure
It is common for streaming data analytics systems to be set up in order to account for a less-structured data format or having certain dimensions missing at any given time (the use of the JSON format is a common solution). One reason for this is that the data dimensions are likely to change over time or, given the immediate nature of streaming data, have a dependency that may be temporarily down and unable to send data.
Large numbers of data values
On top of having a continuous data stream, each individual set of data will generally have many unique values in a set, also referred to as high cardinality. This is particularly the case when dealing with time-series data, where there may be a few commonly used states and a “long tail” of potentially many others that may not be processed very commonly at all, but need to be accounted for by the processing system. This is particularly challenging for a streaming system because, unlike a batch data system, it only gets one pass at the data.
Streaming Data Analytics Implementation
There are many different possible implementations for streaming data. The original widespread implementation has been operational monitoring of physical systems. This can involve, among many other things, processing streams of financial data, manufacturing statistics, medical biometrics or transportation tracking. This enables a greater amount of transparency and control over massively complex and fast-moving projects.
Furthermore, cheaper sensors, higher internet bandwidth and more mobile processing power has meant an increase in data streamed directly from any number of wearable or environmental sensors not part of enterprise projects. Every individual, as well as their habitation and transportation are increasingly generating large volumes of data which are best handled by streaming data systems. On top of that, the data generated through web browsing, e-commerce and social media use is another data stream which is ripe to utilise but too voluminous for batch processing. As such, streaming analytics is an important factor for processing factors such as real-time recommendations, advertising and A/B usability testing.
For a more in-depth look into streaming data analytics as well as its practical real-world applications, tune in to the free BizData webinar on Real-time Analytics: