A Mental Model for DynamoDB Write Limits

#aws #serverless #dynamodb #cdk

Amazon's DynamoDB is promoted as a zero-maintenance, virtually unlimited throughput and scale* NoSQL database. Because of its' very low administrative overhead and serverless billing model, it has long been my data store of choice when developing cloud applications and services.

When working with any tool, it is important to understand the limits involved to be able to create a mental model that can be used to most effectively plan and implement the usage of that tool. It might seem easy to think that because DynamoDB bills itself as "unlimited throughput and scale" you do not need to think about limitations on write and read throughput. However, as we will explore in this post, there are actually very well defined limits that we need to take into account when implementing solutions with DynamoDB.

Partition limits

Amazon documents these partition level limits: 3,000 RCU/s and 1,000 WCU/s. Let's verify these limits and put them to the test. We want to understand what these limits are for an UpdateItem command and if the limits are any different for direct PutItem commands.

Test setup

CDK code can be found here: https://github.com/rogerchi/ddb-writes-model

We create Items of a predefined size by filling them with random binary data. We then set up lambda functions which will spam our DynamoDB Item with either 10,000 atomic updates, or 10,000 full puts. We set up a stream handler to capture the times of the first and last updates to get an idea of how much time it takes to complete these 10,000 operations.

`UpdateItem` code

We first initialize an Item with a count of 0 and then run this lambda function against it.

async function updateItem(id: string) {
  const params = {
    TableName: TABLE_NAME,
    Key: {
      pk: { S: id },
      sk: { S: 'ITEM' },
    },
    UpdateExpression: 'set #count = #count + :one',
    ExpressionAttributeNames: {
      '#count': 'count',
    },
    ExpressionAttributeValues: {
      ':one': { N: '1' },
    },
  };

  let attempt = 0;
  while (true) {
    try {
      const command = new UpdateItemCommand(params);
      await client.send(command);
      return;
    } catch (error) {
      attempt++;
      console.error(`Update failed for item ${id}. Attempt ${attempt}.`);
    }
  }
}

interface Event {
  id: string;
}

exports.handler = async (event: Event) => {
  const id = event.id;
  if (!id) {
    throw new Error('Event does not contain an ID.');
  }

  const totalUpdates = 10000;
  const commands = [];

  for (let i = 0; i < totalUpdates; i++) {
    commands.push(updateItem(id));
  }

  await Promise.all(commands);

  console.log('All updates complete.');
};

`UpdateItem` results (10,000 updates)

Item Size (bytes)	Total time (seconds)	Op/s	WCU/s
1000	10	1000	1000
2000	20	500	1000
4000	40	250	1000

As we can see, the results match the expected write throughput described in the documentation. Does the throughput change if we are using the PutItem instead of the UpdateItem command, since the underlying service doesn't need to retrieve an existing item to update it?

`PutItem` code

async function putItem(id: string, itemBody: any, count: number) {
  const updatedItem = { ...itemBody, count: { N: `${count}` } };
  const params = {
    TableName: TABLE_NAME,
    Item: updatedItem,
  };

  let attempt = 0;
  while (true) {
    try {
      const command = new PutItemCommand(params);
      await client.send(command);
      return;
    } catch (error) {
      attempt++;
      console.error(`Put failed for item ${id}. Attempt ${attempt}.`, error);
    }
  }
}

interface Event {
  id: string;
  size: number;
}

exports.handler = async (event: Event) => {
  const { id, size } = event;

  const data = generateData(id, size);
  const item = marshall({
    pk: id,
    sk: 'PUTS',
    size,
    data: Uint8Array.from(data),
  });

  const totalUpdates = 10000;
  const commands = [];

  for (let i = 0; i < totalUpdates; i++) {
    commands.push(putItem(id, item, i));
  }

  await Promise.all(commands);

  console.log('All updates complete.');
};

`PutItem` results

Item Size (bytes)	Total time (seconds)	Op/s	WCU/s
1000	10	1000	1000
2000	20	500	1000
4000	40	250	1000

Our PutItem results are identical to our UpdateItem results, and conform with the idea that the maximum write throughput for an Item is 1,000 WCU/s.

`TransactWriteItem`

TransactWriteItem costs two WCU per 1KB to write the Item to the table. One WCU to prepare the transaction, and one to commit the transaction. You might assume that for a 1KB Item, given what we learned above, we will be able to perform 500 transactions per second. Does that prove out?

Code

async function transactUpdateItem(id: string) {
  const params = {
    TableName: TABLE_NAME,
    Key: {
      pk: { S: id },
      sk: { S: 'TRXS' },
    },
    UpdateExpression: 'set #count = #count + :one',
    ExpressionAttributeNames: {
      '#count': 'count',
    },
    ExpressionAttributeValues: {
      ':one': { N: '1' },
    },
  };

  let attempt = 0;
  while (true) {
    try {
      const command = new TransactWriteItemsCommand({
        TransactItems: [{ Update: { ...params } }],
      });
      await client.send(command);
      return;
    } catch (error) {
      attempt++;
      console.error(`Update failed for item ${id}. Attempt ${attempt}.`);
    }
  }
}

interface Event {
  id: string;
}

exports.handler = async (event: Event) => {
  const id = event.id;
  if (!id) {
    throw new Error('Event does not contain an ID.');
  }

  const totalUpdates = 10000;
  const commands = [];

  for (let i = 0; i < totalUpdates; i++) {
    commands.push(transactUpdateItem(id));
  }

  await Promise.all(commands);

  console.log('All updates complete.');
};

`TransactWriteItem` results

Item Size (bytes)	Total time (seconds)	Op/s	WCU/s
1000	60	167	333
2000	90	111	444
4000	130	77	615
8000	188	53	851
16000	370	27	864

We see from the TransactWriteItem results that we cannot achieve the full 1000 WCU/s through transactions, and that it seems like there is some fundamental overhead so that smaller items fall far from the 1000 WCU/s throughput and it's not until the item sizes get beyond 8KB that we get closer (but not quite reaching) the 1000 WCU/s partition throughput.

Our mental model for DynamoDB Writes

So what's our foundational mental model for DynamoDB write limits? You need to understand that writes target a specific Item at all times. This Item can have up to 1000 WCUs of Put or Update actions performed on it. This means an item of up to 1KB of size can be written or updated 1,000 times a second. An item of 2KB of size can be written or updated 500 times a second, and an item of 200KB of size can only be written or updated 5 times a second. To be able to exceed these limits, you must shard your writes. But because of instant adaptive capacity, you can think of each Item as being its own separate partition, literally a whole set of infrastructure dedicated to that single Item, which is pretty darn amazing. It's why there's no other service that shards quite like DynamoDB, and why it is the ultimate multi-tenant service, a real example of the power of scale in the cloud.

The caveat is that write throughput is significantly decreased with introducing transactions, from around 167 operations per second for a 1KB item to 27 operations per second for a 16KB item.