How do you know your data is correct?

TL;DR
I built a separate validation pipeline that hits the same API as my main ingestion pipeline, but instead of inserting data, it only computes metrics like count(transaction_id), sum(payout), and sum(revenue) per hour (based on event datetime). It compares these to what’s in the data warehouse and stores the deltas in a validation table.

If the deltas are all zero — great. If not, I know exactly when and where things went wrong.
This approach helps catch silent bugs, data loss, or logic errors — and gives me real confidence that my data is correct.

Introduction

We often trust our pipelines a bit too much.

You build a solid ingestion system — it fetches data from an API, processes it, and loads it into your data warehouse. You wrap it in try-catch blocks, maybe add retries. Everything “works.” But the truth is, just because it ran doesn’t mean it did the right thing.

This post is about how I built a separate validation pipeline to check my main ingestion process — and why it was worth it.

The Problem

Let’s say you have a pipeline that:

Makes requests to an API
Transforms and processes the data
Upserts it into a target table

This sounds simple, but…

What if a silent bug drops some rows?
What if the upsert logic breaks on a specific condition?
What if the transformation maps a field incorrectly?

It’s easy to miss these — especially when nothing crashes.

The Solution: A Second Pipeline for Validation

I decided to build another pipeline. Not to ingest data again, but to verify the data was correctly loaded.

Both pipelines hit the same API endpoint, but the validation pipeline doesn’t touch the warehouse. Instead, it:

Pulls API data for a specific time range (e.g., 22:00–22:59)
Buckets the data by its datetime field, not ingestion time
Computes:
- count(transaction_id) — the data has unique IDs
- sum(payout), sum(revenue), and other key metrics
Compares these metrics against the corresponding metrics from the target warehouse table
Stores the differences (deltas) in a separate validation table, partitioned by 1-hour windows

If the delta = 0 → great
If there’s a mismatch → something’s off

Why This Works

This strategy gives me a lot of confidence because:

Time attribution is accurate: I group by the actual event timestamp, not when the job ran
Deduplication is unnecessary: the transaction_id is unique and enforced
It’s efficient: I store only the differences per hour, not full data dumps
It’s traceable: I can go back and recheck specific hours if something breaks downstream

I can quickly answer questions like:

Did we ingest all the data from 3pm – 4pm yesterday?
Why does today’s revenue look low?
Was there a schema bug at midnight?

What It Doesn’t Catch (And How to Go Further)

While this works really well, I also thought about what I could be missing. Here are a few optional ideas I’ve considered or plan to add:

Schema Drift Checks

Even if counts match, field types or names might change silently.

Fix: Log and diff field lists + data types across runs

Outlier Detection

What if the API sent bad values, but the row count is right?

Fix: Add Z-score anomaly checks on revenue, payout, etc.

Simple Dashboard

Raw deltas are powerful — but a quick UI helps debug faster.

Fix: A simple dashboard showing 1-hour blocks and validation status

Final Thoughts

This setup might sound like overkill, but it gives me peace of mind. The data I work with drives reporting, billing, and business decisions. A 10% undercount isn’t just a bug — it’s a breach of trust.

By decoupling ingestion from validation and verifying everything at the source, I’m not just hoping for correctness — I’m measuring it.

You don’t need a 100% perfect system to trust your data —
but you do need a system that tells you when something’s wrong.

And that’s what this validation pipeline gave me.

Written on July 23, 2025