Stefan Wagner

Posted on Nov 23, 2022

Archive PostGIS location data and keep it queryable with AWS Athena

#aws #athena #postgres #devops

tl;dr

AWS Athena is great :)

The task

Backup tens of millions of location data of cars stored in PostgreSQL/PostGIS, but keep it queryable easily.

Side condition: Use as less code as possible

The input

A PostgreSQL/PostGIS Table in more or less the following structure

id: integer
timestamp: timestamp
location: postgis.geometry(Point, 4326)
speed: double
carid: integer

At the time of starting the task, the table was filled with tens of millions of lines, bloating indexes and making querying very slow

The data is essential, but accessed rarely, especially for data older than a month.

The postgres query

After some struggeling with Athena data types, I finally got the following query to backup data and created a view:

CREATE VIEW v_locations_athena as 
SELECT 
to_char(TIMESTAMP, 'YYYY-MM-DD HH24:MI:ss') /* timestamp in format for Athena* /,
postgis.st_x(LOCATION) AS longitude /* Longitude */, postgis.st_y(LOCATION) AS latitude /* Latitude */,
speed,
carid
FROM locations

The export

A simple bash script was created, doing the query above, export it to CSV, gzip it and transfer it to an AWS S3 bucket

psql -c "\copy (SELECT * from v_locations_athena where timestamp < '2022-08-01')  TO '/tmp/locations_2022-08-01.csv' DELIMITER ',' CSV"
gzip /tmp/locations_2022-08-01.csv
aws s3 cp /tmp/locations_2022-08-01.csv s3://bucketname/locations/

The Athena table

On Athena, I created a table in the same structure as the exported CSV:

create EXTERNAL TABLE IF NOT EXISTS locations_live.locations (
  `timestamp` timestamp,
  `longitude` double,
  `latitude` double,
  `speed` float,
  `carid` int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://bucketname/locations/'
TBLPROPERTIES ('has_encrypted_data'='false');

Note to use the same delimiter in both Athena and CSV Export!

The Athena query

Now that everything is in place, we can start querying data and as we use Athena without setting up any kind of server!

Example to query all locations within 100 meters of a specific point within a given time period:

SELECT
*
FROM locations_live.locations 
where timestamp >=cast('2022-01-01' as timestamp) and timestamp < cast('2022-02-01' as timestamp)
and st_distance(to_spherical_geography(st_point(longitude, latitude)), to_spherical_geography(st_point(12.3456, 49.12345))) < 100

The Athena result

As you can see, tens of millions of rows where scanned within < 30s without any server setup (forget the speed column, it had the wrong unit :-) ) and all that for about 0,006$

The cronjob

Now everything I need to do is upload a CSV e.g. every month, delete old data from database and I'm able to query old data at a very low price (~ 0,0245$ per GB per month in EU)

The Good

Low pricing for storage (~ 0,0245$ per GB per month in EU)
No effort or cost to setup server / query infrastructure
Low pricing for query (~ 5$ per TB scanned) - so in the example above one query costs only about 0,006$

The Bad

Would it be personal customer data (it's not) - it would cause GDPR troubles - even if EU is selected for S3 and Athena; Amazon is a US company, so unfortunately a no-go in terms of GDPR.
Query times vary a bit
Query results are automatically stored to S3 again - so pay attention to that!

The learnings

gzip/compress your data - scanning is faster and pricing is lower (you always pay per scanned data, so less data -> lower price). Additionaly S3 upload is faster. So it's a win/win/win situation :)
pay attention to timestamp formats, bugged me a lot!
Athena has different SerDe (Serialization/Deserialization) options - I chose a simple CSV, but for future use I will definitely play with one of these: https://docs.aws.amazon.com/athena/latest/ug/serde-about.html

The outlook

Leverage Partioning (https://docs.aws.amazon.com/athena/latest/ug/partitions.html) by splitting CSV by month/year/day(?) to speed up querying and reduced costs even more
I started to do the same for a table with heavy jsonb fields usage - interesting new problems here :)

DEV Community