🔍

Why Is No One Talking About Data Reconciliation?

Date

May 9, 2024

The real-time data challenge

Imagine a scenario where an e-commerce platform processes thousands of transactions per minute. Orders are placed, payments are made, and inventory levels are updated in real time. But what happens if there's a discrepancy between the order data in the streaming platform and the records in the database? Such inconsistencies can lead to incorrect inventory counts, missed orders, and ultimately, dissatisfied customers, which can result in significant losses for the business.

When it comes to data replication, full loads are slow and costly in large scale systems. Change data capture mechanisms are available in most OLTP systems and they minimize the need for full loads. Some are better than others, but for the most part they are pretty reliable. All the moving pieces between change data, e.g., database transactions, and its eventual destination are where problems can occur. These issues aren’t common, generally occurring less than a single needle in a typical haystack and more often than people hit the powerball, but that’s often enough to matter. The prevalence of these exceptions can jump in frequency due to a variety of factors ranging from configuration issues to timing and other software problems to network connectivity. This is where data reconciliation comes into play. Data reconciliation involves comparing data from a source system to data at some point downstream to ensure that the data in the latter is both complete and accurate. In streaming integrations, the reconciliation is not on the merged data but the change data. For many businesses that rely on real-time data, reconciliation services are non-negotiable auxiliary components to integrations, crucial for maintaining accuracy, trust, and reliability - not just for analytics for but mission critical business processes.

So why is no one talking about data reconciliation?

There are a few main reasons:

Lack of awareness; it can be difficult to know data is missing unless it’s being reconciled.
It sounds rather boring (it’s not), and it doesn’t have anything to do with generative AI (praise be).
Until recently, there weren’t a lot of available to solutions.
It is a poorly documented phenomenon, with most of the engineers who encounter it resolving it through vendor support tickets, not on Stack Overflow.

Why do we need data reconciliation?

Let’s dive a bit deeper into the causes of source to target integration data discrepancies.

Connector Issues: Data connectors, which link various systems and data sources, can sometimes fail or produce inconsistent results due to configuration errors or network issues. These problems can cause data mismatches that need to be detected and resolved promptly. Reconciliation between the source and Kafka stream can be highly useful in identifying and correcting these mismatches.

DMS (Data Migration Service) Issues: Data migration services, while powerful, are not foolproof. During migrations, data can be lost or corrupted, leading to discrepancies between the source and target systems. Reconciliation between the source and the target landing zone is imperative for identifying and addressing these issues quickly, ensuring data integrity is maintained throughout the migration process.

Delayed Sink in Target: In real-time data streaming, delays in sinking data into the target system can occur due to processing bottlenecks or network latency. These delays can result in temporary inconsistencies that reconciliation between Kafka and the sink system can detect and highlight these exceptions, allowing for timely corrective actions.

The role of reconciliation services in streaming data pipelines

Streaming data pipelines, such as those powered by Confluent Kafka, handle massive volumes of data, and a large percentage of this is change data. They enable real-time data processing, allowing businesses to react instantly to changes and insights. However, issues like those described above can introduce discrepancies between data in motion and data in upstream and downstream storage layers.

A robust reconciliation service for streaming data pipelines ensures that any inconsistencies are detected so that they can be addressed promptly. Here's why this matters:

Regulatory Compliance: For industries like finance and healthcare, accurate data is paramount. Reconciliation services help ensure compliance with regulations by maintaining data integrity.
Operational Efficiency: Automated reconciliation reduces the need for manual data checks, freeing up valuable resources and allowing teams to focus on more strategic tasks.
Customer Trust: Consistent and accurate data translates to better customer experiences. Whether it's processing a bank transaction or tracking a shipment, reliability builds trust.

How Rowlock’s source to stream reconciliation service works

While the technical details of reconciliation services can be complex, the core concept is pretty simple - and it should be. The reconciliation service shown at a high level below continuously monitors the change data flowing through your streaming pipelines and compares it with the change data in the database using a Scala application that is packaged in a Docker container. It does thing without introducing meaningful incremental load on the source system. When a mismatch is detected, the service flags it and streams the exception records to a designated topic, allowing for self-healing, and it allows for plug-and-play integration with your observability platform. To get this up and running, you ensure connectivity and database-specific permissions are correct in place, and then you simply feed a config to the containerized application.

Key Components

Source System: A source system with change data capture or change streams enabled
Docker Container: Encapsulates the configuration based Scala application and its dependencies
Kafka Consumer(s): The destination system on the right, downstream from Confluent Kafka
Source Connector: Streams source system data and into Kafka
Reconciliation Engine: The lightweight Scala application that compares messages in the Kafka topic (e.g., transactions) with data in the source system. Any discrepancies are published to the Kafka topic, allowing for self-healing. This application ensure data is captured before it is cleared from the source system.

By adding a reconciliation service to your streaming data pipelines, you can:

Identify discrepancies in real-time: Immediately detect data mismatches before they culminate into larger issues
Streamline error handling: Automatically route exception records to designated topics or storage layers, for full self-healing, or at a minimum, making it easier for your team to manage and resolve issues
Ensure consistency across systems: Maintain harmony between your streaming data and backend databases, ensuring end-to-end completeness and accuracy across systems

The growing prominence of data reconciliation

If data completeness and accuracy is a requirement for your business, so is data reconciliation. Reconciliation services provide a critical layer of assurance in for businesses that that need to service real-time use cases or derive value from steaming integrations.

Investing in a reconciliation service for your streaming data pipelines means more than just preventing errors; it means safeguarding your operations, enhancing customer satisfaction, and ensuring regulatory compliance. In a world where data is king, reconciliation is the key to reigning supreme.

So, if your business relies on streaming data, consider the value that a robust reconciliation service from Rowlock can bring. It's not just about catching mistakes and fixing them—it's about building a foundation of trust and reliability that can propel your business forward.