tl;dr
AWS Athena is great :)
The task
Backup tens of millions of location data of cars stored in PostgreSQL/PostGIS, but keep it queryable easily.
Side condition: Use as less code as possible
The input
A PostgreSQL/PostGIS Table in more or less the following structure
id: integer
timestamp: timestamp
location: postgis.geometry(Point, 4326)
speed: double
carid: integer
At the time of starting the task, the table was filled with tens of millions of lines, bloating indexes and making querying very slow
The data is essential, but accessed rarely, especially for data older than a month.
The postgres query
After some struggeling with Athena data types, I finally got the following query to backup data and created a view:
CREATE VIEW v_locations_athena as
SELECT
to_char(TIMESTAMP, 'YYYY-MM-DD HH24:MI:ss') /* timestamp in format for Athena* /,
postgis.st_x(LOCATION) AS longitude /* Longitude */, postgis.st_y(LOCATION) AS latitude /* Latitude */,
speed,
carid
FROM locations
The export
A simple bash script was created, doing the query above, export it to CSV, gzip it and transfer it to an AWS S3 bucket
psql -c "\copy (SELECT * from v_locations_athena where timestamp < '2022-08-01') TO '/tmp/locations_2022-08-01.csv' DELIMITER ',' CSV"
gzip /tmp/locations_2022-08-01.csv
aws s3 cp /tmp/locations_2022-08-01.csv s3://bucketname/locations/
The Athena table
On Athena, I created a table in the same structure as the exported CSV:
create EXTERNAL TABLE IF NOT EXISTS locations_live.locations (
`timestamp` timestamp,
`longitude` double,
`latitude` double,
`speed` float,
`carid` int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://bucketname/locations/'
TBLPROPERTIES ('has_encrypted_data'='false');
Note to use the same delimiter in both Athena and CSV Export!
The Athena query
Now that everything is in place, we can start querying data and as we use Athena without setting up any kind of server!
Example to query all locations within 100 meters of a specific point within a given time period:
SELECT
*
FROM locations_live.locations
where timestamp >=cast('2022-01-01' as timestamp) and timestamp < cast('2022-02-01' as timestamp)
and st_distance(to_spherical_geography(st_point(longitude, latitude)), to_spherical_geography(st_point(12.3456, 49.12345))) < 100
The Athena result
As you can see, tens of millions of rows where scanned within < 30s without any server setup (forget the speed column, it had the wrong unit :-) ) and all that for about 0,006$
The cronjob
Now everything I need to do is upload a CSV e.g. every month, delete old data from database and I'm able to query old data at a very low price (~ 0,0245$ per GB per month in EU)
The Good
- Low pricing for storage (~ 0,0245$ per GB per month in EU)
- No effort or cost to setup server / query infrastructure
- Low pricing for query (~ 5$ per TB scanned) - so in the example above one query costs only about 0,006$
The Bad
- Would it be personal customer data (it's not) - it would cause GDPR troubles - even if EU is selected for S3 and Athena; Amazon is a US company, so unfortunately a no-go in terms of GDPR.
- Query times vary a bit
- Query results are automatically stored to S3 again - so pay attention to that!
The learnings
- gzip/compress your data - scanning is faster and pricing is lower (you always pay per scanned data, so less data -> lower price). Additionaly S3 upload is faster. So it's a win/win/win situation :)
- pay attention to timestamp formats, bugged me a lot!
- Athena has different SerDe (Serialization/Deserialization) options - I chose a simple CSV, but for future use I will definitely play with one of these: https://docs.aws.amazon.com/athena/latest/ug/serde-about.html
The outlook
- Leverage Partioning (https://docs.aws.amazon.com/athena/latest/ug/partitions.html) by splitting CSV by month/year/day(?) to speed up querying and reduced costs even more
- I started to do the same for a table with heavy jsonb fields usage - interesting new problems here :)
Top comments (0)