Optimising Range Joins from 140 Trillion Comparisons to Practical Performance

Written on April 16, 2025

TL;DR
Range joins like ip between start_ip and end_ip are brutally slow on large datasets because they result in O(n x m) comparisons.
To solve this, I implemented a bucket-wise join strategy.
1) Convert each ip into a bucket_id.
2) Map start_ip - end_ip ranges into overlappting bucket_ids. 3) Join on bucket_id first and filter with between later.
This reduced the query from 140 trillion comparisons to a scalable, fast join, all without losing accuracy.

Read More

Efficiently Mapping IP Addresses to Geolocation

Written on April 9, 2025

TL;DR
How to efficiently map IP addresses to geolocation data without a paid API.

  • Avoid expanding CIDR blocks into individual IPs - it’s memory intensive and inefficient.
  • Convert IPs to integers (for IPv4) and compare them against CIDR start and end ranges.
  • Use libraries like ipaddresses (in Python) to convert CIDR to start/end and IP to integer.
  • IPv6 integers are too large for Bigint, so store them as strings and use string comparison.
  • Plist IPv4 and IPv6 into separate tables to maintain performance, especially if IPv4 traffic dominates. This method helped me build a scalable, cost-efficient, and fast IP geolocation lookup without relying on paid APIs.
Read More

Handling date formats in big data

Written on July 31, 2024

When working with datasets from multiple sources, it’s common to encounter varying date formats. Each source may provide dates in different formats, leading to challenges in data consistency and processing. Here are some examples of different date formats you might encounter:

Read More