Recently I wanted to create a small mobile app that would search for mailboxes that are around the user's geolocation. Other apps exist, but their performance wasn't really up to scratch, so I wanted to try and find some solutions!
To start with, I managed to retrieve a dataset from the data.gouv.fr website: https://www.data.gouv.fr/fr/datasets/liste-des-boites-aux-lettres-de-rue-france-metropolitaine-et-dom-avec-heure-limite-de-depot-1/ . This dataset contains 140 000 entries, of course the user who wants to display the mailboxes close to him, will not need to display all the mailboxes available in the database, but only those close to him, for example 10km.
For my project I use a SpringBoot API, but that's not what you'll be interested in today. The real problem is how to retrieve all the data, stored in a PostgreSQL database, that are geolocated near a point in a convenient time!
The goal
My goal is to find a set of points that are within a 10km radius, in record time!
PostGIS to the rescue
PostGIS is an extension of PostgreSQL, it offers new functions and data types that facilitate geolocation searches.
To install PostGIS on your server :
~$ sudo apt install postgis
It must then be activated in PosgresSQL, by connecting to the database and adding the postgis extension. It will then be active only in the selected database.
psql (13.6 (Ubuntu 13.6-0ubuntu0.21.10.1))
Type "help" for help.
postgres=# \c test;
postgres=# CREATE EXTENSION postgis;
We can now create a table that contains an id, latitude, longitude and a GPS point. The SRID 4326 value is used to declare a point that uses the geographic reference system.
CREATE TABLE public.test (
id bigserial NOT NULL,
lat real NOT NULL,
long real NOT NULL,
location geography(Point, 4326) NOT NULL
);
We can then easily add a point using the following query:
INSERT INTO test (lat, long, location)
VALUES (49.548462,1.0779799,ST_SetSRID(ST_MakePoint(49.548462,1.0779799), 4326));
Optimization
To validate my optimisation, I added 140,000 rows of geolocated data to my table.
Stupidly, I wanted to search for all the points that were within a radius of less than 10km with the st_distancesphere function.
Here is the query I run to retrieve all GPS points within a 10km radius:
EXPLAIN(ANALYSE , BUFFERS )
SELECT st_distancesphere(t.location::geometry,ST_SetSRID(ST_MakePoint(49.548462,1.0779799), 4326))
FROM test t
WHERE
st_distancesphere(t.location::geometry,ST_SetSRID(ST_MakePoint(49.548462,1.0779799), 4326)) < 10000;
The PostgeSQL query analysis shows that no index is used and that the query takes 463ms to return 21 points that match the conditions.
Gather (cost=1000.00..2776968.00 rows=47062 width=8) (actual time=160.851..461.391 rows=115 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=1759
-> Parallel Seq Scan on test t (cost=0.00..2771261.80 rows=27684 width=8) (actual time=218.887..415.046 rows=58 loops=2)
" Filter: (st_distance(geography((location)::geometry), '0101000020E6100000A48CB80034C64840F5EC03DA673FF13F'::geography, false) < '10000'::double precision)"
Rows Removed by Filter: 70535
Buffers: shared hit=1759
Planning Time: 0.376 ms
JIT:
Functions: 8
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 2.485 ms, Inlining 178.841 ms, Optimization 117.797 ms, Emission 56.983 ms, Total 356.106 ms"
Execution Time: 463.431 ms
We are therefore going to create an index that will allow us to speed up the query (see http://postgis.net/workshops/postgis-intro/indexing.html):
CREATE INDEX test_position_geography_index ON test USING GIST(geography(location));
But we are in for a nasty surprise, as the index still doesn't work. After a quick search, I found out that only certain PostGIS functions can use indexing (see the doc here: http://postgis.net/workshops/postgis-intro/indexing.html#spatially-indexed-functions).
So I decided to use the ST_DWithin function which returns true if two points are at the distance given in the 3rd parameter.
Here is the query followed by the analysis:
EXPLAIN(ANALYSE , BUFFERS )
SELECT st_distancesphere(t.location::geometry,ST_SetSRID(ST_MakePoint(49.548462,1.0779799), 4326))
FROM test t
WHERE
ST_DWithin(t.location, ST_SetSRID(ST_MakePoint(49.548462,1.0779799), 4326)::geography, 10000);
Bitmap Heap Scan on test t (cost=4.95..2390.34 rows=14 width=8) (actual time=1.212..2.205 rows=115 loops=1)
" Filter: st_dwithin(location, '0101000020E6100000A48CB80034C64840F5EC03DA673FF13F'::geography, '10000'::double precision, true)"
Rows Removed by Filter: 200
Heap Blocks: exact=150
Buffers: shared hit=188
-> Bitmap Index Scan on test_position_geography_index (cost=0.00..4.95 rows=72 width=0) (actual time=0.288..0.288 rows=315 loops=1)
" Index Cond: (location && _st_expand('0101000020E6100000A48CB80034C64840F5EC03DA673FF13F'::geography, '10000'::double precision))"
Buffers: shared hit=26
Planning Time: 0.446 ms
Execution Time: 2.288 ms
We can see that the test_position_geography_index is used and that it allows to recover the 21 points in 2.288ms 🤯. The gain is huge, it allows to reduce considerably the execution time and the processor usage time.
In my mailbox geolocation API, this allowed me to cut the request time by a factor of 10 and to relieve the server when simulating a server load with the Apache ab performance testing tool (https://httpd.apache.org/docs/2.4/fr/programs/ab.html).
Have fun with PostGIS, it's a great tool ;)
Top comments (0)