In the previous part, I talked about indexing mechanisms in DynamoDB. I mentioned that there are two types of indexes: Local Secondary Index (LSI) and Global Secondary Index (GSI). I lied to you a little, because there are actually three types. But wait, how is that possible? Is the AWS documentation wrong?
How does a GSI work?
Before I answer that question, let's first go back to the basics. Let's recall how a GSI works. I covered its mechanism in the first part of this series. In short, this index operates under the hood like a separate table. Data is copied into it only when needed. And it is precisely this mechanism that forms the foundation of what can be called a Sparse Index.
Sparse Index
A sparse index is a type of database index where we don't create an entry for every single record in the table - only for the ones we care about. In DynamoDB, this isn't something that's directly supported. But, using the mechanisms we've already talked about, we can simulate it pretty easily.
Standard GSI Index example
Let's take an example. Imagine we're building an app that processes events from temperature sensors. Say each sensor emits an event every second. Each event has an ID
, the sensor ID
, a createdAt
, a value
, and status
. The status is just a simple enum: OK
if the value is within the expected range, or ALARM
if it's outside.
Now, suppose we want to add a section to our dashboard that shows only the alarms for a given sensor. It's easy, we create a new GSI called SENSOR_STATUS_GSI
, where the partition key is {sensorId}#{status}
and the sort key is {createdAt}
.
This is how it looks in the code. We create separate SENSOR_STATUS_GSI_PK
property.
With this setup, querying events by status becomes super straightforward. When we scan index we can see all the events. In this example I created 10 events for 2 sensors (5 events per sensor)
When we want to fetch all the alarms for the sensor, we only need to provide expected GSI PK
But this isn't really the best solution. After all, what we actually need is to query only the events that are alarms. Now we're storing everything. In a system that might generate millions of events, this would quickly increase the cost of our DynamoDB table.
Instead, we can tweak our GSI a little bit.
Sparse Index example
Let's create new ALARM_GSI
index. This one uses {sensorId}
as PK and {createdAt}
as SK.
I also updated the code with the new GSI PK.
When we query our index by sensor ID, we'll only get corresponding alarms.
The magic really starts to happen when we scan our index.
The difference is immediate. It turns out the index now stores only the alarms, no events with an OK
status in sight. The index already contains only the data you want.
Summary
A Sparse Index in DynamoDB is basically a clever hack. Even though DynamoDB doesn't give us native support for this feature, with a bit of creativity we can simulate it using GSIs.
This solution is very useful for large data sets that need to be searched binarily. For example:
- A job queue where we need to search for jobs with a status of
PENDING
to process - Searching only for available products
The downside of this solution is that future changes must be anticipated. If the business also wants to display recent events with a status of OK
, our sparse index won't be able to handle it. This will require either adding another index or modifying the current index and migrating the data.
Sparse indexes are a neat DynamoDB trick. They're not always the right fit, but in scenarios where you only care about a subset of your data, they can save you money and speed up your queries.
Links
Top comments (0)