Jakub Stanisławczyk

Posted on Jul 7

DynamoDB Professional - part 1 - indexing

#aws #node #programming #database

Amazon DynamoDB is a fully managed, NoSQL database designed for high performance and scalability. With the ability to handle massive numbers of requests per second and automatic scaling, DynamoDB is widely used in systems that require reliability and low latency - from mobile applications to IoT systems and microservices.

However, despite its many advantages, DynamoDB has one key limitation - it doesn't forgive mistakes made during the design of your solutions. Unlike relational databases or document databases like MongoDB, here the structure of data and queries must be carefully planned. DynamoDB requires not only knowledge of its indexing mechanisms, but more importantly, a deliberate approach to data modeling and an understanding of the application query patterns. This is crucial, as some decisions cannot be easily reversed without creating a new table and migrating data.

In this article, I will explain you how the DynamoDB indexing mechanism works and show you how to plan your data structure to fully harness the potential of this powerful yet demanding technology.

A bit of theory

Before we dive into the internal mechanisms of DynamoDB, we need to start with the basics. It is the understanding of general database techniques that will help us achieve maximum performance. One of the fundamental topics in any distributed database is the data consistency. In the case of DynamoDB, there are two available read modes: Eventually Consistent Read and Strongly Consistent Read.

What's the difference between them? With Strong Consistency, you are always guaranteed to receive the most up-to-date data. In the other hand, if you choose Eventual Consistency, there's a chance that the returned data may be outdated. Its value will eventually be updated to the expected state, but this will happen with some delay.

The read mode is selected at the query level. Let's see how this works with an example:

export const getUserById = async (id, consistentRead = true) => {
    const ddbClient = initDDBClient();
    const command = new GetCommand({
        TableName: usersTable,
        Key: {
            id,
        },
        ConsistentRead: consistentRead, // Set true/false
    });

    try {
        const response = await ddbClient.send(command);
        return response.Item;
    } catch (error) {
        console.error("Error getting item:", error);
        throw error;
    }
}

Now we can use this function to fetch our data.

const user = {
    id: crypto.randomUUID(),
    name: "John Doe",
    email: "jdoe@examplemail.com",
}

await saveUser(user);

const [eventualUser, strongUser] = await Promise.all([
    // Read with eventual consistency.
    getUserById(user.id, false),
    // Read with strong consistency
    getUserById(user.id, true)
]);

console.log("Read with eventual consistency:", eventualUser);
console.log("Read with strong consistency:", strongUser);

At this point Read with strong consistency will always return correct data. Read with eventual consistency may show empty result.

Now you might be thinking, 'So the solution to all problems is to always choose Strong Consistency by default?'. Not quite. Unfortunately, such reads consume twice as many Read Capacity Units (RCUs). What does that mean? If you're using the On-Demand mode, it means you'll pay twice as much for your queries. If you're using the Provisioned mode, you'll reach the RCU limit of your table much more easily.

So, if your application can tolerate slightly outdated data, it's worth considering Eventual Consistency to optimize costs.

Another important point is that not all queries can support Strong Consistency. But more on that in a moment :)

What can a record consist of?

First and foremost: the primary key. This is what uniquely identifies each entity in our table. It can consist of two components:

Partition Key (always required) - Its value determines which partition a given record will be placed into. This works through a hashing function that distributes entities across physical nodes. As a result, the load is evenly balanced and the data access is faster. Each partition, except of the size (10 GB), also has a maximum of 3000 Read Capacity Units and 1000 Write Capacity Units. If your table grows beyond that, DynamoDB automatically adds more partitions. However, if too much data is stored under a single partition key, it cannot be split. That's why it's crucial for the partition key (PK) to be as granular as possible. Otherwise, we might end up with all records landing in a single partition - a situation known as a hot partition, which significantly slows down queries. Let's visualize it. We have a users identified by a given ID. ID is hashed using an internal algorithm, which determines which partition the item will be stored in.

Sort Key (optional) - Adds powerful capabilities to our table:
- Enables ordering of data within a single partition, so we can retrieve records in a specific sequence - either from the beginning or the end of the table.
- Allows us to associate multiple records with a single partition key. A good example would be documents with multiple versions. Caution⚠️: If we associate too many items with a single partition key, it can still result in a hot partition.
- It also enables more advanced querying capabilities. With only a partition key, we can retrieve data by exact match. But once we introduce a sort key, we can perform range queries, search by prefix, and many more.

Let's see how it looks in practice. A common use case for sort keys is storing versioned data. For example, imagine a system that manages files. Each file has a unique identifier which we use as the partition key. Files can have multiple versions, so we use the version number as the sort key.

This way, all versions of a given file are stored under the same partition, and the sort key allows us to query them efficiently - whether we want to fetch all versions or only specific one.

If you use only the partition key in your query, DynamoDB will return all items that share that partition key (e.g. all versions of a specific file).

Query document by PK = "abc123"
[
    { id: "abc123", version: 1, ...},
    { id: "abc123", version: 2, ...},
    { id: "abc123", version: 3, ...},
]

If you provide both the partition key and the sort key (PK + SK), DynamoDB will return exactly one item - the one that matches both values. This is ideal when you want to fetch a specific version of a file.

Query document by PK = "abc123" and SK = "1"
[
    { id: "abc123", version: 1, ...},
]

Types of Operations

Now that we understand what our data might look like, let’s take a look at the different ways we can retrieve it:

GetItem
BatchGetItem
Query
Scan
ParallelScan
TransactGetItem

Whoa - that's quite a few options just for reading data! So, which one is the best?
Well, it depends on what you're trying to retrieve and how your table is structured. Let's briefly go over each one:

GetItem - The most basic read operation. It retrieves a single item based on the partition key. Note: If your table has a sort key defined, you must provide it as well.
BatchGetItem - Retrieves multiple items in a single batch (up to 100 items or 16MB per request). This is much more efficient than fetching them one at a time.
Query - Offers more advanced search capabilities (e.g. sorting, filtering, range queries). It can also be used to query secondary indexes, which I'll cover shortly.
Scan - Reads every item in the table. Because of its brute-force nature, this operation is slow and expensive. You can apply filters to reduce the number of returned results - but keep in mind that RCU (Read Capacity Units) usage remains the same. It just reduces the size of the received object.
ParallelScan - Works like Scan, but splits the work across multiple parallel workers. It's faster, but can quickly consume your table's available RCU.
TransactGetItem - Reads up to 25 items across one or more tables, with all-or-nothing guarantees. If one item can't be retrieved, none will be returned.

As you can see, the most efficient and preferred options for retrieving data are GetItem, BatchGetItem, and Query.
Scan operations should be your last resort - only use them when none of the other methods fit your use case.

Indexes

A primary key alone often isn't enough - we frequently need more flexibility when querying our data than just using an ID.
That's where indexes come in. Just like in other database engines, indexes in DynamoDB are used to optimize and speed up queries on your tables.
It supports two types of indexes. We can visualize them as follows:

Local Secondary Index (LSI)

An LSI provides an alternative sort key while still using the same partition key. Since it's stored within the same partition as the base table, it supports both eventual and strongly consistent reads. However, LSIs come with several limitations that are important to consider during the design phase:

You can define up to 5 LSIs per table, which can limit your query flexibility.
LSIs share the table's provisioned throughput and storage, so heavy usage of the index may impact performance of other operations.
They must be created at the time the table is created - you can't add LSIs to an existing table.
You'll incur additional costs for storing data in the index.

These constraints mean that you need a clear understanding of your query patterns before designing your schema.

Earlier, we discussed a document metadata table that stores versioned file metadata, where the partition key is the file ID and the sort key is the version number. While this model is effective for accessing different versions of a file, you might encounter situations where you want to query versions of a file by their name - for example, to find all versions with a specific name pattern or to retrieve a renamed file history.

This is where a Local Secondary Index (LSI) becomes useful. By defining an LSI on the name attribute, you can query the same set of items (same partition key) but sort and filter them based on their name instead of version number.

Query LSI "name" index by PK = "abc123" sorted in reverse order
[
    { id: "abc123", version: 2, name: "myfile.txt", ... },
    { id: "abc123", version: 3, name: "final.txt", ... },
    { id: "abc123", version: 1, name: "draft.txt", ... },
]

Global Secondary Index (GSI)

A GSI allows you to define a completely different partition key and sort key, giving you far greater flexibility in querying your data. DynamoDB supports up to 20 GSIs per table, making them a powerful tool for advanced query requirements.

Since GSIs are stored in a separate partition space, they:

Have independent read/write capacity limits
Can be added after the table is created
Support only eventual consistency for reads

As with LSIs, storing data in a GSI incurs additional storage and throughput costs. To control these costs, you can configure attribute projections, which determine what data is copied to the index:

KEYS_ONLY - Only the primary key and index keys are projected
INCLUDE - Keys and selected non-key attributes
ALL - All attributes from the base table

Let's say we have a table that stores user data. Each user has a unique id, which we use as the primary partition key. However, we also want to be able to query all users who belong to a specific team - something that isn't possible with the base table's key structure alone.

To support this access pattern, we can define a GSI with teamId as the partition key. What's important to understand is that a GSI behaves like a separate table under the hood. It has its own partition key and (optionally) sort key, independent from the base table. In the case of our user table, this means that teamId becomes the primary key of the index, allowing us to efficiently query all users assigned to a specific team.

Examples

Testing Eventually Consistent Reads

A common challenge is properly testing code that uses Eventually Consistent Reads. What does a typical integration test verifying database reads look like?

Insert test data
Read the data
Validate its correctness

This ensures each test is independent and can be run in isolation. Unfortunately, as mentioned earlier, Eventually Consistent data can have a delay, which may result in inconsistency during read operations. This, in turn, can lead to failed assertions.

Let's imagine our TEAM_ID GSI in our users table.

The first idea might be to wait and poll periodically until the data is updated.

...Other tests

it('should read users by team', async () => {
    // Arrange
    await saveUser(user);

    // Act
    let attempts = 15;
    let usersByTeam = [];

    for (let i = 0; i < attempts; i++) {
        usersByTeam = await getUsersByTeam(teamId);

        if (usersByTeam.length > 0) {
            break;
        }

        await setTimeout(300); // Wait for eventual consistency
    }

    // Assert
    expect(usersByTeam.length).toBe(1);
});

...Other tests

However, this results in significantly longer test execution times.
A better solution might be to insert the data before all tests and run the eventual consistency tests at the very end. Of course, we still need to wait to be 100% sure that the data has eventually become consistent. However, part of that waiting time will be offset by other tests executing in the meantime.

beforeAll(async () => {
    // Init GSI data before all tests
    await saveUser(user);
})

... Other tests

// GSI tests below
it('should read users by team', async () => {
    // Act
    let attempts = 15;
    let usersByTeam = [];

    for (let i = 0; i < attempts; i++) {
        usersByTeam = await getUsersByTeam(teamId);

        if (usersByTeam.length > 0) {
            break;
        }

        await setTimeout(300); // Wait for eventual consistency
    }

    // Assert
    expect(usersByTeam.length).toBe(1);
});

Prefixing

Prefixing is a common technique used in DynamoDB to structure sort keys in a way that enables more flexible and efficient query patterns.

Instead of using a raw value as a sort key (e.g. just a timestamp), you prefix the sort key with a constant label or category to create a composite value. This allows you to distinguish between different types of data stored under the same partition key and query them accordingly.

Let's consider the following structure:

A company has multiple teams
Each team has multiple users
Each user can upload multiple documents

We can store all of this in a single DynamoDB table using the following pattern:

PK - company ID
SK - TEAM#{teamId}#USER#{userId}#DOCUMENT#{documentId}

Why this Works?

You can query all teams of a company: SK begins_with(SK, 'TEAM#')
You can query all users in a team: SK begins_with(SK, 'TEAM#teamA#USER#')
You can query all documents for a user: SK begins_with(SK, 'TEAM#teamA#USER#user1#DOCUMENT#')

However, it's important to note that parsing the composite sort key is our responsibility. Since DynamoDB stores the sort key as a simple string, it's up to the application logic to split and interpret its components (e.g. extracting teamId, userId, or documentId from a key like TEAM#teamA#USER#user1#DOCUMENT#doc1).

This design also introduces a potential risk of hot partitions. Since all nested data (teams, users, documents) lives under a single partition key (company ID). Let's see this in the next example.

Hot Partition

We're building software for factories that collects sensor events. Since the system is multi-tenant, we decided to model multi-tenancy directly at the DynamoDB level. Specifically, the partition key is the company ID, and the sort key is a composite value in the format {sensorId}#{eventId}. This design allows us to efficiently query for all events from a specific sensor or retrieve a specific event by ID.

However, we've encountered a scaling issue: one of our clients has thousands of active sensors, each sending event once per second. Since all this data shares the same partition key (their company ID), DynamoDB directs all writes to a single partition, which quickly becomes a hot partition. This causes problems such as throttled writes due to exceeding partition throughput limits and poor horizontal scalability.

This example highlights how seemingly clean and logical data models can run into physical limitations in high-throughput environments, especially when many high-frequency entities (like IoT sensors) share the same partition key.

There are several techniques to mitigate this problem but those are more advanced strategies that deserve a dedicated article :)

Summary

As we've seen through the examples, DynamoDB offers immense power, but only when used with a clear understanding of how it works under the hood.

One of the most critical aspects of working with DynamoDB is data modeling, and at the heart of that lies indexing. Choosing the right partition key, designing efficient sort keys and leveraging Global or Local Secondary Indexes is the key to the efficient database.

DynamoDB can be incredibly fast, scalable, and reliable - but only if you know exactly what you're doing. That's why investing time upfront in understanding your access patterns and indexing strategy is not just helpful - it's mandatory.

Getting indexing right is a foundational step - but truly mastering DynamoDB requires diving much deeper. We'll explore those advanced techniques in the next parts of this guide. Stay tuned 🎉.

Links