API

Why is real-time data processing so challenging?

Josh Miramant

July 29, 2020·10 min read

Why is real-time data processing so challenging?

Real-time data analysis is all about closing the gap between data collection, analysis, and action. With ever-increasing volumes of data being generated every second, analytics solutions need to keep up the pace. Businesses across industries share the same goal: to minimize the time it takes real-world events to be captured, interpreted and turned into valuable insights. For this, the entire realm of business intelligence tools must adapt to deal with real-time data: machine learning algorithms, predictive analytics, report-oriented dashboards, and visualization tools alike. By investing in real-time data capabilities, businesses gain the opportunity to act on operational trends as they happen, instead of reacting after the fact. But there are also some challenges when dealing with real-time data. Let us discuss a few examples.

1. Automated data collection

Let us start with the first step in any analytics pipeline: the data collection. The increasing number of data sources and the variety of data formats can pose a real challenge to real-time data processing pipelines. In the online world, web activity is recorded by campaign monitoring tools, e-commerce activity trackers and customer behavior models. Similarly, in a manufacturing plant, a multitude of IoT devices are used to collect performance data from various pieces of equipment. Each coming with their own specification and data format. API specification changes or sensor firmware updates may lead to disruptions of the data collection pipeline, due to changes in the data schema. With traditional batch processing, data collection issues can be identified before any further steps in the data pipeline are performed. But with real-time data pipelines, any faults in the data collection step will propagate towards the next steps in the pipeline. A real-time data processing pipeline needs to account for the cases when readings are missing, in order to prevent inaccurate analytics and potential malfunctions down the line.

2. Data quality concerns

The second challenge specific to real-time analytics is referring to the quality of data. Similar to the negative impact that data collection can have on the overall pipeline performance, a lack of data quality will also be propagated throughout the entire analytics workflow. And nothing is worse than business insights based on inaccurate data. As a matter of fact, poor quality data is identified by Gartner as causing direct losses to businesses’ revenues. We can say that real-time data is no exception to the old rule: garbage in, garbage out. Responsible organizations show extreme care for data accuracy, completeness, and integrity, by sharing responsibility and democratizing data access. Quality strategies ensure that all roles understand the importance of accurate data, and engage them to assume responsibility for maintaining data accuracy. For real-time data, a similar quality strategy must be employed via automated mechanisms, in order to ensure that only reliable data sources are used. This becomes a guarantee for usable business insights and minimizes wasted analytics efforts.

3. Understanding real-time use cases

Existing predictive analytics and machine learning solutions can be adapted to tackle real-time data streams. However, real-time data processing applications come in different shapes and serve different needs. Real-time monitoring and reporting tools, interactive visualization dashboards, and real-time predictive analytics are different real-time solutions for different problems. It is therefore mandatory for businesses to correctly identify their business use case and map it to the correct real-time application. “Although not every application requires real-time data, virtually every industry requires real-time solutions.” - O'Reilly Media. For example, an online recommendation system only has a few milliseconds for choosing and displaying the right ad, before a webpage loads. On the other hand, a financial trading platform needs to track and display price changes in real-time and give suggestions for further trading decisions. IoT-based monitoring tools need to be able to detect anomalies and malfunctions based on real-time sensor readings. Each of these real-time solutions requires a different mix of software and hardware tools, from the early steps of data access all the way to performing advanced analytics. Optimized networks, machine learning libraries, in-memory databases, pre-trained models all need to be configured to seamlessly process real-time data. Achieving cost-effective solutions would be impossible without a proper understanding of the business use-case.

4. Rethinking data processing architectures

Traditional data processing workloads rely on two different systems: one for the collection and storing of the data transactions (OLTP - Online Transaction Processing) and another one optimized for analysis and exploration (OLAP - Online Analytical Processing). Such workloads process data in batches and rely on ETL pipelines for transferring of data between systems. Modern HTAP data processing architectures (Hybrid Transactional / Analytical Processing) make it possible to work with both transactions and analytics and thus enable analytics of live data streams. The hybrid approach has been enabled by the increased availability of in-memory database solutions and the decreased costs of in-memory computing (Building Real-Time Data Pipelines). For the designer of data processing architectures, this can also be a challenge: choosing the appropriate technology among a variety of modern tools that are available. Also, the entire data architecture needs to be reconsidered, in order to limit infrastructure complexity and to build efficient data pipelines.

An illustrative how-to

An example of a data processing architecture adapted to fit real-time processing needs comes from the energy sector: IoT devices enable efficient consumption analytics by measuring electric consumption across households in a city. For the energy provider, processing of such data needs to happen in real-time, in order to ensure operational efficiency. [caption id="attachment_21849" align="aligncenter" width="1454"]

A traditional architecture (left) requires an additional data persistence system and additional data transfer needs (like ETL pipelines processing batches of old data). A real-time architecture (right) allows analytics to be run on the same systems where data is persisted. Removing connection points, data transfer pipelines, and different data formats and APIs is a way to ensure the simplicity of the architecture. Source - Building Real-Time Data Pipelines[/caption] Real-time data does pose a few challenges, but effective solutions are possible, thanks to technology advances such as in-memory computing and distributed systems.

Use cases:

Streamlined loan application processes

A real estate process that poses an interesting challenge for real-time processing is the loan application. A challenge not only for the confused homebuyers but for machine learning models as well. Credit approval models need access to all kinds of data, from personal information, to credit history, historical transactions, and employment history. Manually identifying and integrating all these data sources can quickly turn into a tedious, time-consuming, and annoying task. Moreover, manual processing comes with a high risk of erroneous entries throughout the application. These aspects have turned the manual loan application process into a bottleneck for real estate transactions. If only some automated solution existed to take some of the pain away... Beeline is a company focused on streamlining the loan application process. Their intuitive mobile interface guides buyers through loan applications in minutes. The entire process takes only 15 minutes and claims to save home buyers a lot of headaches. The way they do this is incredibly simple: their service connects to a variety of personal data sources (such as the bank, pay and tax info), uses natural language processing(NLP) to read and collect info, integrates and analyzes all the data in real-time. Like this, tedious and time-consuming processes are bypassed and home-buyers can enjoy a streamlined loan application processes. How is that possible, you’re wondering? Their service is only possible by integrating a mobile-first experience, intelligent processing capabilities, as well as state of the art user design. Their loan guide is delivered via a chat interface, which gives the users an easy way to find answers to their questions. NLP algorithms are backing these interactions and help create a personalized experience. At the same time, automated evaluation algorithms happen in the background, just as the buyer is filling in forms. This shows how automation is key to the success of their service. And the seamless interplay of tech tools is what makes this automation possible in the first place.

Commercial real estate valuations

Another crucial step in commercial real estate is property valuation. Automated Valuation Models are as old as the industry itself, given the task of evaluating properties and establishing pricing schemes. Traditionally, these models were mostly based on historical sales data. However, models relying on past behavior only are missing out on a lot of other data sources. Predictive analytics and modern data collection infrastructures are built to integrate external data sources and train algorithms based on heterogeneous data types. Instead of using a single data type that offers a limited perspective on a property, unified data architectures offer a 360-degree view and integrate external data sources: market demand, macroeconomic data, rental values, capital markets, jobs, traffic, etc. Since there are no hard limits to the data that can be used by a property valuation model, predictive analytics is a powerful tool available to real estate agencies. Smart Capital offers such a modern solution to property valuation. They use predictive analytics for the valuation of real estate properties and promise to deliver a full report within one business day. Their CEO, Laura Krashakova, offers some insights into how they achieve this. “The technology enables data processing and property valuation in real-time and gives individuals access to data previously available only to local brokers. Local insights such as the popularity of the location, amenities in the area, quality of public transport, proximity to major highways, and foot traffic are now readily available and are scored for ease of comparison.” There are two aspects that make such a service possible in the first place: the ease of access and the possibility to deliver real-time insights. Mobile & web platforms make it easy for customers to access, upload, and visualize their data, regardless of their location. All that is needed is an internet connection. At the same time, predictive analytics frameworks are crunching data in real-time, at the speed of milliseconds. Once new data events occur, they are collected and included in the latest analysis report. No need to wait for time-consuming, intensive computations, since all of that computation can now happen almost instantly, in the cloud. Once again, the interplay of modern technologies makes it possible to offer a seamless experience based on real-time insights. At the same time, the variety of external data sources becomes a guarantee for increased valuation accuracy. This saves time, money, and headaches for all parties involved.

What’s next?

There are two pathways to bring data science and real-time analytics capabilities to your firm. First, you can adopt the “build” approach – hire a whole department of specialists in data. This approach can work! However, it is slow and expensive to build such a department, especially if it is outside of your firm’s core competency. The second choice: partner with a data science firm like Blue Orange. With this approach, you get the data science expertise your company needs, without the upfront investment. Take a closer look at how Blue Orange’s experienced data engineers can implement cost-effective, real-time data processing systems. They can assist companies with any of the above-listed challenges and help you turn real-time data into a tangible asset.

Josh Miramant

Josh Miramant is the CEO and founder of Blue Orange Digital, a data science and machine learning agency with offices in New York City and Washington DC. Miramant is a popular speaker, futurist, and a strategic business & technology advisor to enterprise companies and startups. He is a serial entrepreneur and software engineer that has built and scaled 3 startups. He helps organizations optimize and automate their businesses, implement data-driven analytic techniques, and understand the implications of new technologies such as artificial intelligence, big data, and the Internet of Things. Featured on IBM ThinkLeaders, Dell Technologies, and NYC’s Top 10 AI Development and Custom Software Development Agencies as reviewed on Clutch and YahooFinance for his contributions to NLP, AI, and Machine Learning. Specializing in predictive maintenance, unified data lakes, supply chain/grid/marketing/sales optimization, anomaly detection, recommendation systems, among other ML solutions for a multitude of industries. Follow me on Twitter or LinkedIn. Check out my website. Visit BlueOrange.digital for more information and to view Case Studies.

Why is real-time data processing so challenging?

1. Automated data collection

2. Data quality concerns

3. Understanding real-time use cases

4. Rethinking data processing architectures

An illustrative how-to

Use cases:

Streamlined loan application processes

Commercial real estate valuations

What’s next?

More in API

4 Steps on the Path to Becoming a Climate Data Scientist

How to get the latest commodity pricing in Google Sheet

Blockchain for Developers; The Importance of API Providers