DEV Community

Praveen Muthukumarana
Praveen Muthukumarana

Posted on

How Dynamodb partition keys work under the hood

Primary Key

In a relational database, records are uniquely identified by the Primary Key. No two records can have the same primary key.

This concept is the same in Dynamodb as well. The Primary Key of a table is either the Partition key or a combination of the Partition key and the Sort key.

In the table given below, the student ID is the partition key. It is unique for each of the items.

student_id (Partition Key) name email date_of_birth
cb15ffa1-5e86-4c51-b341 John Doe john.doe@example.com 2002-01-15
cbfe7f85-6b85-4d75-8030 Jane Smith jane.smith@example.com 2000-05-22
2accafed-6a3e-4f10-b16e Michael Johnson michael.j@example.com 2001-08-30
93b98d8b-6de3-452e-s Emily Davis emily.davis@example.com 1999-11-10
93b98d8b-6de3-452e-8 William Brown william.brown@example.com 2003-03-25

In the below example, we store the content of articles in this table. Since the article content is long, it is broken into multiple parts and stored in multiple table items to prevent exceeding the 400kb limit Dynamodb imposes on item size.

ArticleId (Partition Key) Part (Sort Key) Content
101 1

Introduction to DynamoDB with examples and illustrations...

101 2

Setting up your DynamoDB environment with detailed steps...

101 3

Basic operations in DynamoDB such as CRUD...

102 1

Understanding different types of databases with pros and cons...

102 2

Benefits of NoSQL databases over SQL databases...

103 1

Advanced DynamoDB features like global tables...

103 2

Performance optimization tips and tricks in DynamoDB...

A single article includes multiple items, so, the partition key cannot be unique. Therefore, a combination of the partition key and the sort key is used to uniquely identify each item. This is called a composite primary key.

Partitions

Dynamodb is a distributed system. This means that unlike in traditional Relational Database Management System (RDBMS), all the data is not stored in a single server. Instead, they are spread across multiple servers.

In a Dynamodb table, items are grouped into partitions. Think of partitions as boxes within the table into which each item is placed. The partitions in a table are stored physically in different nodes (servers) within the region in which the table was created.

To simplify, you put each item of the table into different partitions (think of them as boxes) and then keep those boxes in different locations.

How Dynamodb stores items

This is where the partition key comes into the picture. When you add a new item to the table, the partition key goes through a hash function. A hash function is an algorithm that takes in a value as an input and then outputs a hash code. In this case, the hash code is a fixed-size integer. When the same value is input into the hash function multiple times, it will always produce the same hash code.

Each partition in a table is associated with a hash code range.

To simplify, let’s say that a table has the below partitions associated with the given ranges.

  • Partition 1 — 100 - 200
  • Partition 2 — 201 - 301
  • Partition 3 — 302 - 402
  • Partition 4 — 403 - 503

We are adding this item to the table

{
  "student_id": "cb15ffa1-5e86-4c51-b341-48d1aed0f84f",
  "name": "John Doe",
  "email": "john.doe@example.com",
  "date_of_birth": "2002-01-15"
}
Enter fullscreen mode Exit fullscreen mode

When the student_id is input to the hash function, the output is 352.

Since this is within 302 - 402, the item will go into Partition 3.

How Dynamodb retrieves items

Unlike relational databases, Dynamodb does not offer too much flexibility on how data is queried.

You can only query data using a Partition Key. For example, if you want to query John Doe from the Students table, you can only do so if you know his student_id. To query by any other attribute, you need to create a Global Secondary Index (discussed later). Or perform a scan operation which is computationally very expensive.

This limitation is because items in the table are stored in different partitions that are located at different nodes. Dynamodb needs to know the partition where the item is stored so that it only has to search within that partition. In contrast, performing a scan operation is going through all the items in the table to find the one you are looking for.

Suppose we are querying for John Doe using his student_id which is the partition key. First, Dynamodb will input the student_id into the hash function. The hash function will output the same hash code that was generated when the item was added to the table. By looking at the hash code, Dynamodb can know the partition in which the item is located in.

Inside the partition, items are stored using a data structure called a B-Tree where the items are sorted by the partition key if it is unique. If we are using a composite primary key and the partition key is not unique, items will be sorted by the sort key. This allows Dynamodb to search the item within the partition very efficiently with a time complexity of O(logn).

Pros and cons of this behaviour

As mentioned earlier, the limitation of Dynamodb is that queries are not as flexible as in relational databases. When you are designing a Dynamodb table, you have to consider all querying patterns that will be used by your application and decide on the Primary Key pattern and indexes to enable them.

But on the upside, for these query patterns that are supported by your table keys, data can be fetched extremely efficiently compared to an RDBMS. This is because, unlike in an RDBMS, when the Dynamodb table grows in size, the underlying infrastructure can be scaled horizontally by spreading the data across more nodes. The query knows the precise location of each item using the partition key. Therefore, there will be no change in performance or latency of the table as it grows in size.

How partition key works in a Local Secondary Index (LSI)

Consider this table.

Department (Partition Key) empId (Sort Key) Name Age Location DOJ
HR 1001 John Doe 30 New York 2020-05-15 21:35:51
IT 1002 Jane Smith 28 San Francisco 2019-08-10 01:20:55
Finance 1003 Mike Brown 35 Chicago 2018-03-22 08:49:11
Marketing 1004 Linda White 32 Seattle 2021-06-30 21:22:48
IT 1005 James Green 29 Austin 2017-11-12 07:14:55
HR 1006 Emily Davis 27 New York 2022-02-05 21:07:59

There is a requirement to query the employees in a specific department who joined after a specific date.

This can be achieved by creating an LSI with DOJ as the sort key.

Department (Partition Key) DOJ (Sort Key) empId Name (Projected Attribute)
IT 2017-11-12 07:14:55 1005 James Green
Finance 2018-03-22 08:49:11 1003 Mike Brown
IT 2019-08-10 01:20:55 1002 Jane Smith
HR 2020-05-15 21:35:51 1001 John Doe
Marketing 2021-06-30 21:22:48 1004 Linda White
HR 2022-02-05 21:07:59 1006 Emily Davis

Note that the LSI has the same partition key as the associated item in the base table. Therefore, LSI items are stored in the same partition as their base table counterpart. LSI is the same items as the base table sorted differently. In this case, they are sorted by DOJ. Now you can use this LSI to query for employees who joined within a specific date range. Note that similar to the base table, the primary key of an LSI should be unique. There cannot be two employees who are from the same department and joined the exact same date and time!

How partition key works in a Global Secondary Index (GSI)

Consider the same table as above.

Department (Partition Key) empId (Sort Key) Name Age Location DOJ
HR 1001 John Doe 30 New York 2020-05-15 21:35:51
IT 1002 Jane Smith 28 San Francisco 2019-08-10 01:20:55
Finance 1003 Mike Brown 35 Chicago 2018-03-22 08:49:11
Marketing 1004 Linda White 32 Seattle 2021-06-30 21:22:48
IT 1005 James Green 29 Austin 2017-11-12 07:14:55
HR 1006 Emily Davis 27 New York 2022-02-05 21:07:59

We have another requirement to query employees by their ID without knowing their department. This query cannot be performed using an LSI since we don’t know the partition key.

So, we create a GSI where the partition key is the empId.

empId (Partition Key) Department Name Age Location DOJ
1001 HR John Doe 30 New York 2020-05-15 21:35:51
1002 IT Jane Smith 28 San Francisco 2019-08-10 01:20:55
1003 Finance Mike Brown 35 Chicago 2018-03-22 08:49:11
1004 Marketing Linda White 32 Seattle 2021-06-30 21:22:48
1005 IT James Green 29 Austin 2017-11-12 07:14:55
1006 HR Emily Davis 27 New York 2022-02-05 21:07:59

Since the partition key of the GSI is different from the base table, the items in the GSI will be stored in a partition different from its corresponding item in the base table.

Think of the GSI as a separate copy of the table (with limited attributes that you select from the base table). Same with the LSI, the primary key should be unique.

You can use the GSI to query employees by their ID.

Recap

  • Primary Key uniquely identifies items in a Dynamodb table. It can be just the partition key if that is unique or a combination of the partition key and sort key if the partition key is not unique.
  • Based on the partition key of a record, Dynamodb puts them into different partitions. Partitions in a table are stored in different nodes (servers) within the region where the table was created.
  • In queries, Dynamodb uses the partition key to know the partition in which the item is stored and hence the exact server where the item resides.
  • Within a partition items are stored in a sorted order of partition key if it is unique or else the sort key. This allows Dynamodb to efficiently search for an item within the partition.
  • LSI is like a copy of the table sorted using a different attribute but items still have the same partition key and are stored in the same partition as the corresponding base table item.
  • GSI is also like a copy of the table but the items have a different partition key and they are stored in partitions different to the corresponding items in the base table.

Top comments (0)