DEV Community

Cover image for greatCircleDistance in ClickHouse: Avoiding Full Table Scans
Mohamed Hussain S
Mohamed Hussain S

Posted on

greatCircleDistance in ClickHouse: Avoiding Full Table Scans

When working with location data, one problem shows up almost immediately:

“How do I calculate the distance between two coordinates stored in my database?”

At first, it seems like something you’d have to handle outside the database.

But if you're using ClickHouse, there’s a built-in function for this.


The Right Tool: greatCircleDistance

greatCircleDistance(lat1, lon1, lat2, lon2)
Enter fullscreen mode Exit fullscreen mode

It calculates the shortest distance between two points on Earth.

Example

SELECT greatCircleDistance(13.0827, 80.2707, 12.9716, 77.5946) AS distance_meters;
Enter fullscreen mode Exit fullscreen mode

This gives you the distance between Chennai and Bangalore - in meters.


Looks Simple… But There’s a Catch

Now let’s say you write a query like this:

SELECT city
FROM locations
WHERE greatCircleDistance(lat, lon, 13.0827, 80.2707) < 5000;
Enter fullscreen mode Exit fullscreen mode

At first glance, this looks perfectly fine.

But this can quietly turn into a full table scan - especially on large datasets.


Why This Happens

In ClickHouse, indexes don’t work like traditional B-tree indexes.

They are:

  • Sparse
  • Designed for range pruning

They work well for queries like:

WHERE lat BETWEEN x AND y
Enter fullscreen mode Exit fullscreen mode

But not for:

WHERE greatCircleDistance(lat, lon, x, y) < 5000
Enter fullscreen mode Exit fullscreen mode

Because:

The function is applied on the columns, so ClickHouse cannot use the index to skip data efficiently.


The Better Approach (What You Should Actually Do)

Instead of directly applying the function, reduce the dataset first.

Bounding Box Filter

SELECT city
FROM locations
WHERE lat BETWEEN (13.0827 - 0.05) AND (13.0827 + 0.05)
  AND lon BETWEEN (80.2707 - 0.05) AND (80.2707 + 0.05)
  AND greatCircleDistance(lat, lon, 13.0827, 80.2707) < 5000;
Enter fullscreen mode Exit fullscreen mode

(The bounding box is an approximation to reduce the search space before exact filtering.)


Why This Works

  • lat BETWEEN → uses index
  • lon BETWEEN → reduces rows further
  • greatCircleDistance → applied only on filtered data

So instead of scanning the entire table:
=> You narrow it down first, then compute accurately


Real-World Use Cases

This pattern is useful in:

  • Delivery radius filtering
  • Finding nearby users
  • Geo-based analytics
  • Ride-sharing systems

One Important Gotcha

Make sure:

  • Coordinates are in degrees (not radians)
  • Order is always (lat, lon)

Swapping them will give incorrect results.


Final Thoughts

greatCircleDistance is powerful - but if used blindly, it can hurt performance.

In ClickHouse, performance often depends more on how you query than what you query.

Sometimes, the right approach isn’t just using a function - but knowing when and how to use it efficiently.


Top comments (0)