Scalable Proximity Search: Why Geohashing Beats Radius Queries

#geohashing #algorithms #systemdesign

TL;DR: Geohashing maps 2D coordinates to a 1D string using recursive binary partitioning and bit interleaving. By encoding these bits into a Base-32 string, we leverage B-Tree prefix matching for efficient spatial lookups, bypassing the high CPU costs of Haversine distance calculations at scale.

Calculating Haversine distances for 10,000 moving objects per tick is a disaster for database performance. Standard SQL radius queries aren't built for high-concurrency spatial updates; they are computationally heavy and fail to scale because they require calculating the distance between the query point and every potential candidate in the dataset. To keep latency low, you need to stop thinking in floating-point coordinates and start thinking in B-Tree friendly strings. This is where geohashing comes in, moving the heavy lifting from the CPU to the database index.

How does binary partitioning resolve geographic coordinates?

Binary partitioning recursively divides the map into smaller quadrants, assigning a 0 or 1 based on whether a point falls in the lower/left or upper/right half of the current bounding box. This creates a deterministic bitstream representing a specific geographic area rather than a precise point.

I recently posted a video about geohashing because it is the most efficient way to handle real-time location data. We start by looking at the globe vertically. If a coordinate is in the Northern hemisphere, we assign a 1; if it is in the Southern hemisphere, a 0. We then take that specific hemisphere and split it again. Is the point in the upper or lower half of that new section? This recursive halving continues until we reach the desired resolution. We perform the exact same operation horizontally for longitude (Left/Right splits), resulting in two separate binary streams that describe the point's location with increasing precision.

Why do we interleave bitstreams for 1D spatial indexing?

Interleaving, also known as bit-zipping, alternates bits from the latitude and longitude streams to create a single sequence that preserves 2D proximity within a 1D format. This process follows a Morton order curve (or Z-order curve), which maps multi-dimensional data to one dimension while maintaining the locality of the data points.

In a standard implementation, the first bit of the latitude stream is followed by the first bit of the longitude stream, then the second bit of each, and so on. This ensures that the resulting combined bitstream—and the subsequent geohash string—represents a specific "tile" on the map. Because the bits are interleaved, points that are close together in a 2D space are highly likely to share the same prefix in their 1D bitstream.

Bit Index (i)	Latitude Bit (V_i)	Longitude Bit (H_i)	Interleaved Result (V_0 H_0 ... V_i H_i)
0	1	0	10
1	0	1	1001
2	1	1	100111
3	0	0	10011100
4	1	1	1001110011

How does Base-32 encoding optimize database lookups?

Base-32 encoding converts the final interleaved bitstream into a compact, human-readable string using a specific 32-character alphabet (0-9, b-z) that deliberately excludes ambiguous characters like a, i, l, and o. This representation is highly efficient for database indexing because strings are stored lexicographically in B-Trees, allowing for lightning-fast range scans.

When a taxi app needs to find drivers near you, it does not calculate the distance to every driver in the city. It identifies your geohash—for example, gcpvj0—and queries the database for any driver whose geohash starts with the same prefix. This turns a complex spatial intersection into a simple string comparison. Since the database index is sorted, the system can find all records within the same geographic tile by performing a single index seek followed by a range scan, which is significantly faster than any geometric calculation.

FAQ

What is a Z-order curve and how does it relate to geohashing?
A Z-order curve is a space-filling curve that visits every point in a multi-dimensional grid while preserving the locality of points. Geohashing uses this curve by interleaving bits; the path the bits follow as you increase precision looks like a repeating 'Z' shape. This is what allows us to represent 2D data in a 1D string index without losing spatial context.

Why does the geohash alphabet exclude certain letters?
The standard geohash alphabet (the Crockford-inspired Base-32) excludes the letters 'a', 'i', 'l', and 'o' to prevent human transcription errors and character confusion. This makes the hashes more robust when being passed through URLs, logged in debugging consoles, or manually entered by engineers.

How do you handle the 'boundary problem' where nearby points have different hashes?
Because the Z-order curve occasionally 'jumps' (for example, at the equator or the prime meridian), two points can be centimeters apart but have entirely different geohash prefixes. To mitigate this, production proximity services do not just search the user's current geohash tile; they calculate and query the eight immediate neighboring tiles as well.

Cheers.

Top comments (1)

Andre Cytryn • Mar 28

the boundary problem section is underrated. querying all 8 neighbors feels obvious in retrospect but a lot of implementations I've seen skip it and then wonder why nearby results go missing near cell edges. one thing worth mentioning: at coarser precision levels the neighbor cells get large enough that you end up with false positives and need a post-filter pass with actual distance math anyway. have you found a sweet spot for precision level that avoids both problems?