Scaling App Data in DynamoDB With Vertical Partitioning

#aws #database #programming #systemdesign

Partitioning means to divide something up in to smaller, organized parts. When discussing application data, partitioning is used to increase the scalability of data storage solutions.

Horizontal and Vertical Partitioning

Typically, partitioning data is discussed in terms of vertical partitioning and horizontal partitioning. Consider a data model for a blog site that has the following structure

{
  "threads": {
    "0000000009": {
      "id": "0000000009",
      "title": "How to Build Software",
      "text": "A transistor is like a light switch...",
      "category": "technology",
      "comments": [
        {
          "timestamp": 1234567890,
          "userId": "rytheturtle",
          "comment": "Good read...."
        }
      ]
    }
  }
}

If our site has many threads, we'll need to figure out how to partition this data so we can scale out our database.

To Horizontally partition our threads table, we would take entire thread records and use some algorithm to divide up the records in the table across some number of partitions. If we have 3 different servers that store threads, each whole thread record is stored on one of the three servers using some algorithm to determine what server a particular thread is stored on.

Vertical Partitioning still has the goal of splitting up data in to smaller parts, but does so by dividing up data by the columns rather than the rows. Take our above threads example. Instead of dividing up whole users across our 3 different servers, the data is divided up amongst the different servers at the column/attribute level. So a single thread might have it's title attribute stored on one server, and the same thread's text attribute might live on a different server.

Vertical Partitioning Data and DynamoDB

Vertical partitioning is a particularly useful concept when modeling application data in DynamoDB. DynamoDB is a fully managed, serverless, key/value, NoSQL database offering from AWS. Behind the scenes, DynamoDB(DDB) automatically distributes your application data using the Partition Key attribute specified when creating a DDB table.

While DDB is nearly infinitely scalable to reliably handle any amount of traffic at high scale and with consistent performance, there are some limitations that application developers have to keep in mind:

DDB is billed on Write Capacity Units (WCU) and Read Capacity Units (RCUs). WCUs and RCUs are billed in 1KB and 4KB size increments respectively, per second. DDB rounds up to the nearest 1KB/4KB respectively.
Attribute sizes are effectively unlimited, Table sizes are effectively unlimited, but the size of a single item in a table is limited to no bigger than 400KB.

The implications of these limitations boil down to "big items are expensive to read and write, and your items have a size limit".

Vertically partitioning your items in DDB unlocks several important benefits:

Size of Data vertically partitioning data means you can store virtually infinite sized data in DDB despite the 400KB limit.
Read and Write costs The application and load only relevant parts of an item for a specific feature, reducing the total read and write costs by not reading and writing unnecessary data.

For example, consider the discussion forum data model from earlier. If we just store the threads in a table as a single item with a partition key of "title" and the value includes everything in the JSON structure, we'll very quickly run in to problems. The Item will grow in size as more comments are added, making reading and writing each post to the database more expensive. Worse, our application will be forced to put hard limits on the number of comments that can be added to a post and the maximum size of a thread's text to ensure it can all fit in a single record.

Both of these issues can be avoided by vertically partitioning the data.

Vertical Partitioning for DDB using Partition and Sort keys

DynamoDB uniquely defines items in a table using a Partition Key and an optional Sort Key. The Partition Key (pk) is the main input to determine where an item will be stored. The Sort Key(sk), if defined, is used to sort the items inside of DDB's storage. Together, the PK and SK make up the unique identifier for a record.

To vertically partition data in DynamoDB, it's best to use a combination of partition key and sort key. Define a partition key that keeps related data colocated, and use the sort key to separate out specific attributes from a logical entity in to separate DynamoDB items.

Taking our blog site data from earlier, we can split the data for a single post in to several DDB items in the same table in the following way

{
  // :: separates partition key from sort key,
  // for display purposes only
  "0000000009::META": {
    "pk": "0000000009",
    "sk": "META",
    "thread_id": "0000000009",
    "title": "How to Build Software",
    "category": "technology",
    "createdAt": 98765432
  },
  "0000000009::TEXT": {
    "pk": "0000000009",
    "sk": "TEXT",
    "value": "A transistor is like a light switch..."
  },
  "0000000009::COMMENT#1234567890": {
    "pk": "0000000009",
    "sk": "COMMENT#1234567890",
    "userId": "rytheturtle",
    "comment": "Good read...."
  }
}

Here, we've used a special naming scheme to partition our thread information in to different records for metadata, text, and comments. With this vertical partitioning,

Despite the 400KB/item limit of DDB, we can add an infinite amount of comments to a single thread and simply load them by querying for keys with the prefix <thread id>#COMMENT.
Adding a new comment involves only the WCUs to insert a new record with the content of that one comment, rather than writing to the entire list of comments for the thread.
The size limits of a comment, metadata, and post are now independent of each other. The size of what is logically considered a single thread can effectively grow infinitely as long as the individual pieces that we divided out with vertical partitioning adhere to the record size limits in DDB.
Using thread ID as a partition key, we can still keep all the related data for a thread co-located, making it more efficient to load multiple aspects of a specific thread in a single batch read call if necessary.