How do you know your data is correct?

Written on July 23, 2025

TL;DR
I built a separate validation pipeline that hits the same API as my main ingestion pipeline, but instead of inserting data, it only computes metrics like count(transaction_id), sum(payout), and sum(revenue) per hour (based on event datetime). It compares these to what’s in the data warehouse and stores the deltas in a validation table.

Optimising Range Joins from 140 Trillion Comparisons to Practical Performance

Written on April 16, 2025

TL;DR
Range joins like ip between start_ip and end_ip are brutally slow on large datasets because they result in O(n x m) comparisons.
To solve this, I implemented a bucket-wise join strategy.
1) Convert each ip into a bucket_id.
2) Map start_ip - end_ip ranges into overlappting bucket_ids. 3) Join on bucket_id first and filter with between later.
This reduced the query from 140 trillion comparisons to a scalable, fast join, all without losing accuracy.

Efficiently Mapping IP Addresses to Geolocation

Written on April 9, 2025

TL;DR
How to efficiently map IP addresses to geolocation data without a paid API.

Avoid expanding CIDR blocks into individual IPs - it’s memory intensive and inefficient.
Convert IPs to integers (for IPv4) and compare them against CIDR start and end ranges.
Use libraries like ipaddresses (in Python) to convert CIDR to start/end and IP to integer.
IPv6 integers are too large for Bigint, so store them as strings and use string comparison.
Plist IPv4 and IPv6 into separate tables to maintain performance, especially if IPv4 traffic dominates. This method helped me build a scalable, cost-efficient, and fast IP geolocation lookup without relying on paid APIs.

Processing Large Datasets in Databricks for Data Warehouse Integration

Written on March 19, 2025

Introduction

In an ideal world, all incoming data files would have the same schema, be well-organised, and follow a consistent structure. However, in reality, data files vary in format due to:

Handling date formats in big data

Written on July 31, 2024

When working with datasets from multiple sources, it’s common to encounter varying date formats. Each source may provide dates in different formats, leading to challenges in data consistency and processing. Here are some examples of different date formats you might encounter: