DEV Community: Volisoft

DynamoDB design patterns considered harmful

Tue, 18 Feb 2025 04:10:06 +0000

Originally published at Volisoft

Overview
Introduction
Deeper Dive: Quantification is Key
Case Study: Online Team Game - Different Data, Different Designs
The Twist: Evolving Data Changes Everything
Revised Assumptions: Stats are Scarce
Conclusion: Data-Driven Design is Key

Overview

Key Takeaway: DynamoDB design patterns are helpful illustrations, not rigid rules.

Effective DynamoDB design requires quantitative analysis of your data and access patterns.
Applying patterns without this analysis risks increased costs and degraded performance.
This article presents an automated DynamoDB design approach to address these critical challenges.

Introduction

Official AWS documentation offers DynamoDB design patterns to guide users migrating from relational to NoSQL databases, specifically DynamoDB.
However, these patterns are technique demonstrations, not prescriptive solutions.
Truly efficient and cost-effective DynamoDB design depends on a deep, quantifiable understanding of your data and anticipated access patterns.

Deeper Dive: Quantification is Key

“Understanding” here means quantification.
For data, this involves knowing the volume of each entity type and the distribution of key values.
For access patterns, it means determining data retrieval volumes and the frequency of each query.
Ignoring these quantitative factors can lead to higher operational costs and reduced application performance.

Case Study: Online Team Game - Different Data, Different Designs

To illustrate the importance of data characteristics, let’s consider an online team game example.
We’ll model two entity types: Game and Stats.
Let’s define their attributes and expected volumes:

Table 1: Game and Stats Entities: Data Volumes and Attributes

Entity	Count	time+team/name[id]	time	team/name	archived?	game/data	stats/data
Game	1000	1	15	30	500	2	0
Stats	1000	1	15	30	0	0	2

In this online team game scenario, ’time’ represents the game timestamp.
On average, each team generates 30 Game records and 30 Stats records.
The fields ’time’ and ’team/name’ (represented as ’time+team/name[id]’) uniquely identify both Game and Stats entities.
’archived?’, ’game/data’, and ’stats/data’ represent additional attributes associated with each entity type.

The application needs to support the following queries.
Understanding the frequency and expected return size of each query is crucial for optimal schema design:

Table 2: Application Query Profile: Frequency and Expected Return Sizes

Query Name	Entity	Partition Key	Sort key	Frequency	Return Count
time->games	Game	time		1	5
team-time>games	Game	team/name	time	20	1
time+archived?->game	Game	time	archived?	1	5
time->stats	Stats	time		1	5
team+time->stats	Stats	team/name	time	1	1

Based on these data characteristics and query patterns, a read-optimized indexing schema could be structured as follows:

Table 3: Read-Optimized Schema (Initial Data Assumptions)

:table-cnt	:table	:pk	:sk	:entity
2000	MAIN	time	team/name	Game
2000	MAIN	time	team/name	Stats
2000	GSI1	team/name		Game
2000	GSI1	team/name		Stats

Table 4: Query Costs (Initial Data Assumptions)

:query	:query-tbl	:query-cost
time->games	MAIN	15
time+archived?->game	MAIN	5
time->stats	MAIN	15
team+time->games	GSI1	20
team+time->stats	GSI1	1

The Twist: Evolving Data Changes Everything

Software applications evolve, and so does their data.
Initial assumptions about data distribution may become outdated as requirements change.
Optimization priorities can also shift, perhaps focusing on outlier cases rather than typical scenarios.
Let’s examine how revised data assumptions impact database design.

Revised Assumptions: Stats are Scarce

Previously, we assumed an average of 30 Stats records per team.
Now, let’s assume we still have 30 Game records per team, but dramatically reduce the Stats records to just 3 per team.
This seemingly small change has significant design implications.

Table 5: Read-Optimized Schema (Revised Data Assumptions)

:table-cnt	:table	:pk	:sk	:entity
2000	MAIN	time	team/name	Game
2000	MAIN	team/name	time	Stats
2000	GSI1	team/name		Game
2000	GSI1	time		Stats

Table 6: Queries (Revised Data Assumptions)

:query	:query-tbl	:query-cost
time->games	MAIN	15
time+archived?->game	MAIN	5
team+time->stats	MAIN	1
team+time->games	GSI1	20
time->stats	GSI1	15

This shift in Stats record volume suggests a revised indexing strategy.
Indexing Stats records using ’team/name’ as the partition key becomes more efficient due to its increased specificity in this scenario.
A more specific partition key (lower cardinality) enhances DynamoDB’s ability to distribute data effectively.
Consequently, the query mapping adapts: retrieving Stats records by ’team/name’ [PK] and ’time’ [SK] for individual items can now be efficiently executed on the MAIN table.
Conversely, retrieving Stats records by ’time’ is now better served by querying the GSI1 index.

Conclusion: Data-Driven Design is Key

Key Takeaway: Data-driven design is critical.

Different data characteristics suggest different design choices.
Blindly applying patterns can be costly and inefficient.
Embracing a data-centric approach, especially with automated analysis, leads to efficient, cost-effective, and performant DynamoDB database designs.

Start from the Middle: Making Programming Easier (Part 1)

Fri, 26 Jul 2024 00:00:00 +0000

Start from the Middle: Making Programming Easier (Part 1)

This post is a continuation of “Start from the Middle: How to Solve a Problem.”

Programming is challenging for many reasons, including the multitude of decisions a programmer must make during development. Often, when faced with a problem, it’s not even clear where to start. My advice is to start from the middle.

Problem-solving process

The previous post described a 'philosophical' view on problem-solving. The idea is that starting from the middle of a problem is advantageous because it provides a retrospective view — looking backward as if the problem, or part of it, has already been solved. This allows to make a better decision about the where to start. This also makes it easier to think about the next steps, as we can abstract from the details of how we got to the middle part and focus on the rest of the solution.

While this approach is universal and can be applied to numerous problems, how can we apply it to programming? Here’s the outline of the proposed method.

Look at the state in the middle of computation
Determine variables and their initial state
Determine the final condition
Code the rest

The Middle Part

Starting from the middle means looking at the problem as if some part of it has already been solved, visualizing what has been done so far and what remains to be done. What constitutes the middle of a computation depends on the program. To make things more concrete, let’s assume our program has a loop, and the loop is the central part of the algorithm. In this case, it is helpful to look at the state of the program at the beginning of the loop iteration.

What has been computed so far? What variables are needed to represent that work? What does the value of each variable represent? Is this information sufficient to complete the computation and obtain the result?

These are some questions we may need to ask ourselves.

Note: not all programs have loops, but the most useful and interesting ones use iteration/recursion

Trivial Example

For the rest of this article, we’ll assume the reader has basic programming knowledge and some familiarity with the Python programming language. Python has a simple syntax, and hopefully, readers can easily translate the examples here to their preferred language.

As the first example, we will do a list summation problem: writing a program that computes the sum of all the numbers in a list L. The result should be stored in the variable s:

s := ∑L

It’s clear that we need to look at each number in the list to find the sum, so we need a loop. Let’s imagine the state of the program in the middle of its execution.

What has been done so far? It is reasonable to think that we have a partial sum up to some index i of the list L. Therefore, we need the variable i to represent this.

Now, what exactly does the sum s represent at this point? Should the sum include the value at L[i] or not? Since it doesn’t seem to be important, we arbitrarily choose to exclude L[i] from the sum at the beginning of the iteration. With this interpretation, i represents the number of items processed so far. If it’s not clear why i represents the number of list items processed, consider the case when i=0. This interpretation of s affects the value of s and i at the beginning of the program.

Initial state. We initialize i with 0 because we haven’t processed any list items yet. Since there are no numbers before i=0, and the sum of an empty segment is 0, we initialize s with 0.

When should the computation stop? We assumed that we’re going to iterate through the list in a loop. We should stop the iteration when all numbers are included in the sum s. We remind ourselves again about the meaning of i — the number of list items processed so far. That means the program should stop when the whole list has been processed. In other words, the loop should continue while i remains less than the length of the list: i < len(L).

Here’s the program so far:

def sum_of(L):
   i = 0 # number of processed list items
   s = 0 # sum of i items in L

   while i < len(L):
     # 0 < i < len(L)
     # s = ∑ 0<n<i, L[n] -- sum of L items up to index i exclusive
     ...

   # i = len(L) -- after loop exits
   return s

Filling in the details. Let’s take a look at the body of the loop. To make progress, we need to increase i. The minimal step is to increase i by 1, so we add the i increment to the loop. We also need to update s before the i increment, not after. (Do you see why? Hint: remember what i means). We update the program.

def sum_of(L):
   i = 0 # number of processed list items
   s = 0 # sum of i items in L

   while i < len(L):
     # 0 < i < len(L)
     # s = ∑ 0<n<i, L[n] -- sum of L items up to index i exclusive

     s = s+L[i]
     i = i+1

   # i = len(L) -- after loop exits
   return s

This completes the development.

As an exercise, try to solve the list summation problem, but now interpret s as the sum of the list items in the range 0 to i inclusive. This will yield a different program.

Summary

Starting from the middle means looking at the problem as if some part of it has already been solved, visualizing what has been done so far and what remains to be done. Using a simple list summation problem, we highlighted the steps to iteratively build solutions using this method. Each step is justified, minimizing arbitrary decisions and promoting a systematic approach.

A common approach to programming involves guessing and debugging until the program works as expected. However, programming becomes much easier when approached methodically.

The method discussed here transforms arbitrary decision-making into a structured activity. This method is not limited to programming but can also be applied to any problem-solving task.

Start from the Middle: How to Solve a Problem (Part 0)

Fri, 26 Jul 2024 00:00:00 +0000

You are faced with a task, a problem you need to solve. There are many ways to start, making it hard to choose. Once a choice is made, further down the road it turns out to be unfit, and you reset back to the start. Idea after idea, option after option, there’s a slow progress. After each unsuccessful attempt, you are back to square one.

If only there were a way to know which of the many options would work. A way to fast-forward and retrospectively discern the decisions that led to success. Rather than starting at the beginning, this crossroad of decisions, why not project ourselves a bit further along the trajectory?

Why not start from the middle?

Overcoming hard challenges yields invaluable experience. In retrospect, it becomes evident which of decisions were good and which were not so much. This retrospective clarity is the perspective we need when confronting new problems.

Let's term this approach “starting from the middle”. Envisioning from the midpoint means to visualize what has been done so far and what remains to be done. From this vantage point, it becomes feasible to identify the probable steps taken and actions that are likely to lead forward.

To illustrate, imagine setting a goal to write a novel. The prospect is thrilling yet overwhelming. You are inundated with advice on plot structure, character development, narrative style, and countless other aspects of novel-writing. Where do you begin? Each attempt to outline or draft the first chapter seems inadequate, leading you back to the beginning, frustrated and uncertain.

Now let’s take a midpoint view. Project yourself some time into the future. Imagine you’ve written half of your novel. What does your manuscript look like? What themes have emerged? How have your characters evolved? What plot twists have captivated your readers? Do you feel the narrative flow is engaging and coherent?

If envisioning the mid-term result proves elusive, perhaps there’s too much information missing. It’s important to start not too far from the beginning, ensuring it's possible to connect the dots between the middle and the start. Equally vital is recognizing that not all problems have a solution. Identifying an insurmountable problem early conserves time, energy, and mental well-being.

Summary

The beginning of a problem-solving overwhelms our minds with myriad of options. Working backwards from the goal can occasionally be effective, but often the disconnect is too vast to bridge the end result with the current state. Starting from the middle seems like an optimal strategy.

Streamlining NoSQL Database Design with AI: A Case Study using Amazon DynamoDB

Wed, 20 Mar 2024 00:00:00 +0000

In this article, we explore a basic yet practical use case, and demonstrate how it can be modeled in Amazon DynamoDB with a single-table design approach. Leveraging NoSQL Architect, an AI-powered tool, we showcase the potential of automated database design.

E-commerce Application Example

Let's consider an e-commerce application for processing customer orders. Here's a breakdown of the entities and their attributes:

Product: (product_id, name, category, price, quantity)
Order: (order_id, customer_id, order_date, status)
Customer: (customer_id, name, email, shipping_address)

The application needs to support these queries:

Query products by category
Query orders based on customer ID and order status
Query customer details along with all their associated orders
Query the latest orders for a specific customer

Automating Design with NoSQL Architect

With information about entities and queries, we can generate a database design using NoSQL Architect.

Here, the Entities/Cardinalities table specifies entities, data fields and their cardinalities. Cardinality is the estimated number of unique entity items associated with a field. For example, in Product entity, p/id field has a cardinality of 1. because p/id is a unique field - it is associated with exactly one Product record. Entity Customer and field p/id has a cardinality of 0 because there's no association. Similarly, Product and p/name has a cardinality of 2 (max), because we estimate that there can be at most 2 products with the same name based on our sample data. It's important to note that cardinality can be modeled for maximum, average, minimum or any other relevant statistic. In our example we use max.

Every case is different, and these assumptions about data will likely not hold true for a different e-commerce system. Moreover, these assumptions may change with time, in which case the design desicions should be revised. This is where automation tools like NoSQL Architect are most useful. By simlpy updating the inputs, NoSQL Architect can automatically output an optimal schema for the database.

Optimized Design in Seconds

NoSQL Architect delivers a cost-effective database design optimized for both read and storage efficiency, typically within seconds. Here's a sample solution:

Read-optimized schema
| :table-cnt | :table |        :pk |       :sk |  :entity |
|------------+--------+------------+-----------+----------|
|    1101000 |   MAIN |       p/id |           |  Product |
|    1101000 |   MAIN |       o/id |           |    Order |
|    1101000 |   MAIN |       c/id |           | Customer |
|    1001000 |   GSI1 |       c/id |  o/status |    Order |
|    1001000 |   GSI1 |       c/id | c/address | Customer |
|     100000 |   GSI2 | p/category |           |  Product |


Queries
|              :query | :query-tbl | :query-cost |
|---------------------+------------+-------------|
|     Order by status |       GSI1 |           5 |
| All customer orders |       GSI1 |         100 |
|       Latest orders |       GSI1 |         100 |
|   Prod. by category |       GSI2 |       10000 |

Summary

NoSQL Architect offers a unique, free solution for generating optimized database schemas, setting a new standard in database design automation. This AI-powered tool allows to create efficient single-table designs that scale effortlessly with your application's growth. As software requirements evolve, the database schema must accommodate these changes. Redesigning is an extremely costly endeavor, and reducing development costs is the primary motivation behind NoSQL Architect.

NoSQL Architect vs AWS expert

Thu, 30 Nov 2023 00:00:00 +0000

Introduction

Let's compare two AWS DynamoDB database schemas: one designed by an AWS expert and another one generated by automated NoSQL Architect tool. The human expert’s schema incorporated best practices and years of experience in the field. On the other hand, our tool utilizes mathematical modeling techniques to optimize the schema based on the specific characteristics of the data.

The primary focus of this comparison is to analyze cost savings, particularly in terms of read queries and storage costs.

Case study

For the experiment we take an example from AWS blog. The author discusses the concept of a single-table design.The idea is to store all application data in a single table. This may seem counterintuitive to those familiar with relational databases. However, Amazon uses this approach for its internal designs. This is also the approach we use in our NoSQL Architect tool to generate database schemas.

To summarize, the blog post walks through converting a relational model into a single AWS DynamoDB table.

Relational model of the Alleycat application from the AWS Blog.

Setup

We start by listing the access patterns from the article:

Get the results for each race by racer ID.
Get a list of races by class ID.
Get the best performance by racer for a class ID.
Get the list of top scores by race ID.
Get the second-by-second performance by racer for all races.

The database entities with their attributes:

classes: class-id, class-name
races: race-id, class-id
race-results: race-id, racer-id, second
racers: racer-id, racer-name

Last attribute appears later in the article but is not listed in the relational model. We include it here as a part of the data description.

The effectiveness of any database design depends on how the data is distributed and accessed. Key factors include query frequency, the average number of records returned per query, and the size of the data set.

The AWS blog post does not provide details about data distribution or query behavior. Therefore, we made reasonable assumptions about query frequencis and record counts, summarized below. We assume a dataset size of 1 million records.

Query #	# of records returned	Frequency	PK	SK	Return
1	7	1000	“racer-id”		“race-id”,“second”
2	30	1000	“class-id”	“second”	“race-id”
3	20000	100	“class-id”		“racer-id”,“race-id”
4	7	1000	“race-id”	“second”	“racer-id”
5	1	1000	“racer-id”	“race-id”	“second”

Cost criteria

To compare performance, we focused on the cost of reads and data storage. We used the following scoring method:

Unique Record Queries: For queries that return a single record (e.g., query #5), the schema must uniquely identify the record. If the schema allows for this, the query scores 1. Otherwise, the score reflects the actual number of records returned.
Projected Attributes: For queries using indexes, we account for any missing attributes that require additional requests to the main table. Each extra request adds to the cost.
Storage: Storage costs include both the main table and index storage. We assign one unit of cost for each record stored in the main table or index.

The AWS blogpost schema

Below are the cost estimates for the schema suggested in the AWS blogpost.

Data attribute	Schema
:race-id	(“gsi1_:pk” “gsi1_:f” “main1_:f” “lsi1_:sk”)
:class-id	(“gsi1_:pk”)
:racer-id	(“gsi1_:f” “main1_:pk” “main1_:sk”)
:racer-name	(“main1_:f”)
:second	(“main1_:f” “gsi1_:sk”)
:class-name	(“gsi1_:f”)

Column prefixes indicate table name, e.g. `main1` refers to the main table, `gsi1` refers to the GSI1 index and so on. Suffix denotes the column type, e.g. `pk` (partition key), `sk` (sort key) or `f` (unindexed field).

Query	Cost
query::race-id,second->(“racer-id”)	7
query::racer-id,race-id->(“second”)	2
query::class-id,second->(“race-id”)	30
query::racer-id->(“race-id” “second”)	5
query::class-id->(“racer-id” “race-id”)	20000
storage	4000000
total	6044000

Costs of indivudual queries are listed in the Costs table. Total cost accounts for frequency of each query execution and storage costs,

Table	Records #
:gsi 1	2000000
:lsi 1	1000000
:main 1	1000000

Optimized schema

From the cost breakdown, it is clear that the most expensive query is class-id->(“racer-id” “race-id”). This is due to the small number of unique class-id values (50) and the large number of records returned. Based on these insights, NoSQL Architect restructured the indexes, reducing the costs.

Data attribute	Schema
:race-id	(“gsi1_:f” “main1_:sk” “gsi2_:sk” “gsi2_:pk”)
:class-id	(“gsi1_:pk”)
:racer-id	(“gsi1_:f” “main1_:pk” “gsi2_:f”)
:second	(“gsi1_:sk” “gsi2_:f” “gsi2_:sk”)

Query	Cost
query::race-id,second->(“racer-id”)	7
query::racer-id,race-id->(“second”)	1
query::class-id,second->(“race-id”)	30
query::racer-id->(“race-id” “second”)	5
query::class-id->(“racer-id” “race-id”)	20000
storage	3000000
total	5043000

Note, the optimized index structure also reduced the amount of records in storage from 4 million to 3 million records total, or 25% reduction in storage costs.

Table	Records #
:gsi 1	1000000
:gsi 2	1000000
:main 1	1000000

This optimized schema results in a 25% reduction in storage costs and an overall 16.5% reduction in total costs compared to the original schema!

Summary

The optimized schema generated by NoSQL Architect reduced costs by 16.5% compared to the schema created by an AWS expert. These savings were achieved by taking into account the unique characteristics of the data, such as query frequencies and result sizes.

Beyond cost, NoSQL Architect also offers significant time savings, generating the optimized schema in under a minute, whereas manual optimization and testing could take weeks or even months to achieve similar results.

DEV Community: Volisoft

DynamoDB design patterns considered harmful

Table of Contents

Overview

Introduction

Deeper Dive: Quantification is Key

Case Study: Online Team Game - Different Data, Different Designs

The Twist: Evolving Data Changes Everything

Revised Assumptions: Stats are Scarce

Conclusion: Data-Driven Design is Key

Start from the Middle: Making Programming Easier (Part 1)

Start from the Middle: Making Programming Easier (Part 1)

Problem-solving process

The Middle Part

Trivial Example

Summary

Start from the Middle: How to Solve a Problem (Part 0)

Summary

Streamlining NoSQL Database Design with AI: A Case Study using Amazon DynamoDB

E-commerce Application Example

Automating Design with NoSQL Architect

Optimized Design in Seconds

Summary

NoSQL Architect vs AWS expert

Introduction

Case study

Setup

Cost criteria

The AWS blogpost schema

Optimized schema

Summary