Beyond the Haversine Formula: Why I Use Geohashing for Spatial Search

#geohashing #architecture #databaseoptimization

TL;DR: Geohashing encodes 2D coordinates into hierarchical string prefixes, transforming expensive O(n) geometric calculations into efficient indexed lookups. By mapping geographic areas to unique string identifiers, databases can execute high-speed prefix scans rather than performing floating-point evaluations on every record. This allows applications to scale proximity searches to millions of concurrent users without hitting CPU bottlenecks.

I have seen too many backends crawl to a halt because they are trying to solve geometry problems at the query layer. The naive approach to finding a nearby taxi or a local restaurant is to store latitude and longitude as floats and then run a radius query. While this works in a development environment with a hundred rows, it fails at production scale because it forces the database to perform an unindexed geometric predicate on every single record.

When I am architecting a system that needs to handle thousands of concurrent spatial queries, I stop asking the database to do trigonometry. Instead, I treat location as a string matching problem. By moving the complexity from the CPU to the index, I can keep response times low even as the dataset grows into the millions.

Why is calculating distance in a database expensive?

Proximity is not a property that standard B-Tree indexes can efficiently filter without specialized spatial extensions. This usually forces the database engine to perform an expensive floating-point evaluation on every record in the table to determine if a point falls within the search radius.

In a standard relational database, an index is great for finding an exact match or a range of numbers. But a radius is a circle, and latitude/longitude are two independent variables. To find objects in that circle, the database has to calculate the distance between your center point and every potential candidate. If I have 100,000 drivers and 1,000 users searching at once, the CPU is effectively pinned just trying to solve high-school geometry millions of times per second. It simply does not scale.

How does Geohashing solve the spatial indexing problem?

Geohashing encodes latitude and longitude into a single string by recursively partitioning the map into a grid. This maps 2D spatial data onto a 1D string, where points that are physically close share the same character prefix, allowing the database to bucket locations together.

I think of it as a recursive subdivision of the world. If I split the map into a grid and label a large square 'B', any point inside that square starts with the letter 'B'. If I divide 'B' into smaller squares and label one 'C', any point in that smaller square now has the prefix 'BC'. As I continue this process, I build a hierarchical prefix tree. A six-character geohash represents a specific neighborhood, while an eight-character hash points to a specific street corner.

Geohash Length	Approximate Area Coverage	Engineering Use Case
1	5,000km x 5,000km	Global data sharding
4	39km x 19km	Regional search / Weather
6	1.2km x 0.6km	Local delivery / Dispatching
8	38m x 19m	Precise asset tracking

Why is prefix matching better than radius math?

Prefix matching turns a complex spatial calculation into a simple index range scan. By querying for a specific string prefix, I am leveraging the database’s primary or secondary index to find nearby points in logarithmic time rather than linear time.

When I use geohashes, I am no longer asking the database to calculate distances. I am asking it to find every record where the location_hash starts with a specific string, like bcde. This is an operation that every modern database, from PostgreSQL to DynamoDB, is built to do at high speed. It essentially turns a spatial query into a standard B-Tree lookup, which is significantly cheaper on the CPU than executing the Haversine formula across the entire dataset.

How do you handle points on the edge of a grid?

I handle the "edge case"—where two people are physically close but separated by an arbitrary grid line—by querying the user's current square plus its eight immediate neighbors. This ensures complete coverage without sacrificing the performance gains of the indexed lookup.

While querying nine squares sounds like more work than querying one, it is still orders of magnitude faster than a full table scan. Most geohashing libraries provide a simple function to calculate these "neighboring" hashes. By fetching these nine prefixes in a single batch, I can guarantee that I never miss a nearby taxi just because it happens to be across the street in a different grid cell.

FAQ

Can I use Geohashing with NoSQL databases like DynamoDB?
Yes, this is a primary use case for Geohashing. Since DynamoDB doesn't have native spatial types, you can store the geohash as a Sort Key to perform efficient prefix scans, which is the only way to do performant "nearby" searches in most NoSQL environments.

How do I decide the length of the geohash to store?
I usually store at a high precision (10-12 characters) but query at a lower precision. For a ride-sharing app, I might query at length 6 to get a 1km search area, then do a quick client-side sort to find the absolute closest driver from that filtered subset.

Is Geohashing better than PostGIS?
PostGIS is excellent for complex polygons and precise geographic analysis. However, if your only goal is to find "points near me" at massive scale, Geohashing is often easier to implement, cheaper to run, and more portable across different database technologies.**