AWS Timestream - an Intro

#serverless #aws #database

Working on serverless IoT platforms with event sourcing (yeah, buzzwords...) you quickly have to solve the issue of data storage. You probably wanna back up your events somewhere, you probably want a bus, but you most definitively want to have a database for them as well, especially if you're going to do any kind of analytics.

AWS (finally) released a new database to do just that this month: Timestream.

This article is an intro into Timestream - in the next articles, i'll write some more about details of the system.

Timestream is a time-series database, similar to influx and graphite. And let's face it, elsticsearch (it's not really a DB, but... )
As such, you can add events/rows easily, but you normally don't edit your data.

Core concepts

In Timestream, data is stored in a table, that is part of a database. Standard so far.
The data you're storing is three parts:

Dimensions: the metadata of your event. Which device triggered it, this kind of stuff
Measure: the actual datapoint you measure. has a name (measure_name, a value and a pre-defined type)
Time: The time your event occurred at. The key sorting point. Every event has a timestamp. You can have more timestamps as Dimensions of course, but time is similar to a key for your record.

Quirks

In typical AWS fashion, Timestream has a couple of quirks

Data Retention Mgmt

For larger systems, this is pretty awesome: When creating a table, you establish a memory and a magnetic storage window. If the time of an event is older than your memory storage window, it's automatically offloaded to magnetic storage (but still accessible, albeit slower). Once it's hits the additional magnetic storage window, the data is deleted. Timestream only accepts new events that are in the memory storage window. Common values would be 6-12 months for the memory storage and 2-3 years for the magnetic storage.

NOTE: at least right now, it seems that Timestream has a bug where you have to configure the memory storage to be slightly more than 12 months (as in, 6 hours more) to accept any event older than 30 days.

Measure Cardinality

To be honest, this is still giving me headaches: Timestream only allows A SINGLE MEASURE per record. You can have a high number of "describing" dimensions, but only a single measure. If you have an IoT device that produces an event that has multiple measurements (for example, the fuel consumption of a motor at a certain RPM) this will result in multiple records.
I have so far not have any negative consequences of that (turns out that you actually very seldom need both values at the same time) but it's a weird way to design your data if you come from both the document stores or classical SQL databases.

Why even bother then?

I can't really compare it to the "platzhirsch" influx, I haven't used influx in great detail.
BUT Timestream has a great benefit: it doesn't need a container, ec2 instance or anything similar. It's a fully managed AWS service, where you only pay for usage.
If you're working on serverless environments, it's worth a look. And that's what I'm doing at the moment :)

In the next article, I'll look into setting it up and getting data into it with python

(cover image from https://www.flickr.com/photos/58314390@N08/15937475583)