DynamoDB design patterns considered harmful

#programming #aws #productivity #database

Originally published at Volisoft

Overview
Introduction
Deeper Dive: Quantification is Key
Case Study: Online Team Game - Different Data, Different Designs
The Twist: Evolving Data Changes Everything
Revised Assumptions: Stats are Scarce
Conclusion: Data-Driven Design is Key

Overview

Key Takeaway: DynamoDB design patterns are helpful illustrations, not rigid rules.

Effective DynamoDB design requires quantitative analysis of your data and access patterns.
Applying patterns without this analysis risks increased costs and degraded performance.
This article presents an automated DynamoDB design approach to address these critical challenges.

Introduction

Official AWS documentation offers DynamoDB design patterns to guide users migrating from relational to NoSQL databases, specifically DynamoDB.
However, these patterns are technique demonstrations, not prescriptive solutions.
Truly efficient and cost-effective DynamoDB design depends on a deep, quantifiable understanding of your data and anticipated access patterns.

Deeper Dive: Quantification is Key

“Understanding” here means quantification.
For data, this involves knowing the volume of each entity type and the distribution of key values.
For access patterns, it means determining data retrieval volumes and the frequency of each query.
Ignoring these quantitative factors can lead to higher operational costs and reduced application performance.

Case Study: Online Team Game - Different Data, Different Designs

To illustrate the importance of data characteristics, let’s consider an online team game example.
We’ll model two entity types: Game and Stats.
Let’s define their attributes and expected volumes:

Table 1: Game and Stats Entities: Data Volumes and Attributes

Entity	Count	time+team/name[id]	time	team/name	archived?	game/data	stats/data
Game	1000	1	15	30	500	2	0
Stats	1000	1	15	30	0	0	2

In this online team game scenario, ’time’ represents the game timestamp.
On average, each team generates 30 Game records and 30 Stats records.
The fields ’time’ and ’team/name’ (represented as ’time+team/name[id]’) uniquely identify both Game and Stats entities.
’archived?’, ’game/data’, and ’stats/data’ represent additional attributes associated with each entity type.

The application needs to support the following queries.
Understanding the frequency and expected return size of each query is crucial for optimal schema design:

Table 2: Application Query Profile: Frequency and Expected Return Sizes

Query Name	Entity	Partition Key	Sort key	Frequency	Return Count
time->games	Game	time		1	5
team-time>games	Game	team/name	time	20	1
time+archived?->game	Game	time	archived?	1	5
time->stats	Stats	time		1	5
team+time->stats	Stats	team/name	time	1	1

Based on these data characteristics and query patterns, a read-optimized indexing schema could be structured as follows:

Table 3: Read-Optimized Schema (Initial Data Assumptions)

:table-cnt	:table	:pk	:sk	:entity
2000	MAIN	time	team/name	Game
2000	MAIN	time	team/name	Stats
2000	GSI1	team/name		Game
2000	GSI1	team/name		Stats

Table 4: Query Costs (Initial Data Assumptions)

:query	:query-tbl	:query-cost
time->games	MAIN	15
time+archived?->game	MAIN	5
time->stats	MAIN	15
team+time->games	GSI1	20
team+time->stats	GSI1	1

The Twist: Evolving Data Changes Everything

Software applications evolve, and so does their data.
Initial assumptions about data distribution may become outdated as requirements change.
Optimization priorities can also shift, perhaps focusing on outlier cases rather than typical scenarios.
Let’s examine how revised data assumptions impact database design.

Revised Assumptions: Stats are Scarce

Previously, we assumed an average of 30 Stats records per team.
Now, let’s assume we still have 30 Game records per team, but dramatically reduce the Stats records to just 3 per team.
This seemingly small change has significant design implications.

Table 5: Read-Optimized Schema (Revised Data Assumptions)

:table-cnt	:table	:pk	:sk	:entity
2000	MAIN	time	team/name	Game
2000	MAIN	team/name	time	Stats
2000	GSI1	team/name		Game
2000	GSI1	time		Stats

Table 6: Queries (Revised Data Assumptions)

:query	:query-tbl	:query-cost
time->games	MAIN	15
time+archived?->game	MAIN	5
team+time->stats	MAIN	1
team+time->games	GSI1	20
time->stats	GSI1	15

This shift in Stats record volume suggests a revised indexing strategy.
Indexing Stats records using ’team/name’ as the partition key becomes more efficient due to its increased specificity in this scenario.
A more specific partition key (lower cardinality) enhances DynamoDB’s ability to distribute data effectively.
Consequently, the query mapping adapts: retrieving Stats records by ’team/name’ [PK] and ’time’ [SK] for individual items can now be efficiently executed on the MAIN table.
Conversely, retrieving Stats records by ’time’ is now better served by querying the GSI1 index.

Conclusion: Data-Driven Design is Key

Key Takeaway: Data-driven design is critical.

Different data characteristics suggest different design choices.
Blindly applying patterns can be costly and inefficient.
Embracing a data-centric approach, especially with automated analysis, leads to efficient, cost-effective, and performant DynamoDB database designs.