In this article we'll cover:
- 3️⃣ 3 steps to grasp DynamoDB working as a Graph database
- 📺 Work with examples and illustrative data
- 💪 Increasing visibility of the serverless stack
DynamoDB is the NoSQL, managed database by AWS. It is designed more as a key-value store. But it can be hacked to work as a graph database.
Which is pretty cool 😉
How so?
There is a data access pattern called Adjacency List. It's a way to describe nodes and edges. We can use it in DynamoDB to leverage graph relationships.
Building the concept in three steps:
Step 1: Basic Keys
Each entry in DynamoDB has two basic keys: primary
and sort
.
The first (primary-key
) allows efficient access to the entries in the DB. Think of it as the entries' IDs
.
The second (sort-key
) allows... well, for sorting results, obviously.
As an example, we could model e-commerce purchases as:
-
primary-key
: order-ID -
sort-key
: timestamp
Primary-Key | Sort-Key | Products |
---|---|---|
order-1 | 1570147200 | ['Tesla Roadster', 'Sunglasses'] |
order-2 | 1569976212 | ['Chocolate', 'Gummies'] |
. . . | . . . | . . . |
This allows us to query by Order ID, as well as sort orders by their timestamp.
Step 2: Many-to-Many Relationships
One could say a graph (node-edge model) is a form of a many-to-many relationship. As such, it can be represented by an adjacency list. The primary-key
represents the top-level item, while the sort-key
is used for associations. Consider:
- A blog has two posts: "Post-1" and "Post-2"
- There are three possible tags: "cool", "awesome", "neat"
Primary Key | Sort Key | Title | Text |
---|---|---|---|
Post-1 | Post-1 | Hello World! | I'm a Post inside Dynamo! |
Post-2 | Post-2 | Foo Bar | Can't wait for the graph |
Tag-1 | Tag-1 | cool | null |
Tag-1 | Tag-1 | awesome | null |
Tag-1 | Tag-1 | neat | null |
Access Patterns:
Retrieve Post-1:
primary-key == 'Post-1' && sort-key == 'Post-1'
Retrieve Tag-2:
primary-key == 'Tag-2' && sort-key == 'Tag-2'
Now, say Post-1 has one tag: cool.
Primary Key | Sort Key | Primary Title | Sort Title |
---|---|---|---|
Post-1 | Tag-1 | Hello World! | cool |
Tag-1 | Post-1 | cool | Hello World! |
The connection is represented twice, which is important from an access pattern standpoint, as you can see below:
Access patterns:
Get tags assigned to Post-1:
primary-key == 'Post-1' & sort-key BEGINS_WITH 'Tag'
Get posts tagged with Tag-1:
primary-key == 'Tag-1' & sort-key BEGINS_WITH 'Post'
The BEGINS_WITH
operator allows matching multiple Tags associated with a Post, or vice versa.
By having two entries in the DB for each connection, we can reach the other node regardless of our starting point. If we have a tag
, we can find all posts
associated. If we have a post
, we can find all tags
assigned.
Step 3: The Graph
We'll take the tripestore model for this example. It consists of subject-predicate-object, such as:
Subject | Predicate | Object |
---|---|---|
John | Likes | Music |
Ross | Friends With | Chandler |
Now, how do we model this data structure in DynamoDB? We leverage the adjacency pattern!
Say we have a Friendly social media site. Here's how we could model it in DynamoDB.
3.1 Nodes
We start by describing the nodes
. In our example, we only have users
. These will be the subjects
and objects
of the triplestore model. A name
property is also added to each user:
Primary Key | Sort Key | Name |
---|---|---|
User-1 | User-1 | Ross |
User-2 | User-2 | Rachel |
User-3 | User-3 | Monica |
3.2 Predicates
Primary Key | Sort Key |
---|---|
Pred-Sibling | Pred-Sibling |
Pred-Friend | Pred-Friend |
Pred-Married | Pred-Married |
In the case of predicates, there are no properties associated.
3.3 Connections
Finally, we model the relationships between nodes
using predicates
to connect them. We also add a property indicating when the relationship started:
Primary Key | Sort Key | Relationship Start |
---|---|---|
User-1 | Pred-Sibling-User-3 | 1994-09-24 |
User-3 | Pred-Sibling-User-1 | 1994-09-24 |
User-2 | Pred-Friend-User-3 | 1994-09-24 |
User-3 | Pred-Friend-User-2 | 1994-09-24 |
User-1 | Pred-Married-User-2 | 1999-05-20 |
User-2 | Pred-Married-User-1 | 1999-05-20 |
Again, we have two entries for each connection, also for access patterns reasons, as explained above in the post <> tag
example.
Here's how we read the node
connections:
- Ross and Monica are siblings
- Monica and Rachel are friends
- Rachel and Ross are married
Access patterns to retrieve our data:
Who is married to Ross?
primary-key == 'User-1' & sort-key BEGINS_WITH 'Pred-Married-User'
Who are Monica's siblings?
primary-key == 'User-3' & sort-key BEGINS_WITH 'Pred-Sibling-User'
Who are Rachels's friends?
primary-key == 'User-2' & sort-key BEGINS_WITH 'Pred-Friend-User'
Going deeper
The example above is very simplified, but it illustrates the idea.
It's possible to add other types of nodes to the graph. For instance:
Primary Key | Sort Key | Title |
---|---|---|
Location-1 | Location-1 | Cafeteria |
One advantage is that links can have their own properties as well. For instance: in the Ross-Married-Rachel
link, we could have properties such as where the couple met:
Primary Key | Sort Key | Meeting Place |
---|---|---|
User-1 | Pred-Married-User-2 | Location-1 |
Or we could model the "Meeting Place" as a node in the graph:
Primary Key | Sort Key |
---|---|
Pred-MeetingPlace | Pred-MeetingPlace |
User-1 | Pred-MeetingPlace-Location-1-User-2 |
User-1 | Pred-MeetingPlace-User-2-Location-1 |
Access patterns:
Get everyone Ross met at the Cafeteria:
primary-key == 'User-1' & sort-key BEGINS_WITH 'Pred-MeetingPlace-Location-1'
Where did Ross meet Rachel?
primary-key == 'User-1' & sort-key BEGINS_WITH 'Pred-MeetingPlace-User-2'
This is actually kind of a hack of the triplestore model. The predicate Pred-MeetingPlace-Location-1
is actually a combination of a real predicate Pred-MeetingPlace
and a node Location-1
. This allows for flexible queries. Think of it like the node attached modifying, or characterizing the predicate.
The full table combined is available down at the end of the article.
Heads up
It is important to think about your access patterns before modeling your data. It may be difficult to add support for different access patterns down the road.
Check the DynamoDB documentation to learn more.
Why to Graph On DynamoDB?
DynamoDB is easy to get started with and keeps infrastructure hurdles to a minimum. Especially for startups and small teams that can't afford a DevOps team to maintain healthy DB servers/instances, it can be a great ally.
In summary, Dynamo will give us:
- Fully managed & serverless: minimal infrastructure overhead
- Hyper scalability: in both IO and storage, Dynamo can scale to tens of thousands of concurrent requests and exabytes of data
- Integration with Lambda
- Microsecond latency
- Built-in global replication
Will I lose control of my DB by using serverless?
That's a common concern about serverless.
We really need to ask ourselves: what and why do I want to control?
Usually, it boils down to monitoring, making sure everything is running smoothly. Serverless indeed requires a different approach to observability.
There are professional services, that can take care of that. Dashbird, especially, was created from the ground up with the goal to provide full visibility over entire serverless stacks, so you might want to check them out.
Full Dynamo Tables
Post + Tag Example
Primary Key | Sort Key | Primary Title | Sort Title | Text |
---|---|---|---|---|
Post-1 | Post-1 | Hello World | null |
I'm a Post in Dynamo! |
Post-2 | Post-2 | Foo Bar | null |
Graph is cool! |
Tag-1 | Tag-1 | cool | null |
null |
Tag-1 | Tag-1 | awesome | null |
null |
Tag-1 | Tag-1 | neat | null |
null |
Post-1 | Tag-1 | Hello World | cool | null |
Tag-1 | Post-1 | cool | Hello World | null |
Friends Social Media
Primary Key | Sort Key | Title | Rel. Start |
---|---|---|---|
User-1 | User-1 | Ross | null |
User-2 | User-2 | Rachel | null |
User-3 | User-3 | Monica | null |
Location-1 | Location-1 | Cafeteria | null |
Pred-Sibling | Pred-Sibling | null |
null |
Pred-Friend | Pred-Friend | null |
null |
Pred-Married | Pred-Married | null |
null |
Pred-MeetingPlace | Pred-MeetingPlace | null |
null |
User-1 | Pred-Sibling-User-3 | null |
1994-09-24 |
User-3 | Pred-Sibling-User-1 | null |
1994-09-24 |
User-2 | Pred-Friend-User-3 | null |
1994-09-24 |
User-3 | Pred-Friend-User-2 | null |
1994-09-24 |
User-1 | Pred-Married-User-2 | null |
1999-05-20 |
User-2 | Pred-Married-User-1 | null |
1999-05-20 |
User-1 | Pred-MeetingPlace-Location-1-User-2 | null |
null |
User-1 | Pred-MeetingPlace-User-2-Location-1 | null |
null |
User-2 | Pred-MeetingPlace-Location-1-User-1 | null |
null |
User-2 | Pred-MeetingPlace-User-1-Location-1 | null |
null |
Photo Credits:
- Cover image: by Victoria Heath on Unsplash
Full Disclosure
I work as a Developer Advocate at Dashbird.
Additional resources:
- There is a great talk from re:Invent 2018 covering design patterns in DynamoDB, I definitely recommend watching.
- AWS has documentation with info on the adjacency list implementation.
Top comments (11)
Is there really a good reason for all the duplication? It just looks like a manual index to me.
Why not just a GSI with the sort key as primary key reversed, and, depending on your needs, a few projected attributes, like the names? This also looks like the approach Amazon are using in their documentation for sort-key design for adjacency lists.
So, for getting all tags for Post-1, query the main table for "PK = Post-1 AND begins_with(SK,Tag-)".
For getting all posts for Tag-1, query the GSI for "SK = Tag-1 AND begins_with(PK, Post-)". If you want the name in the same query, and you're not tagging anything but posts (or mostly want tags for everything), you can even avoid duplicating the title column by not using begins_with (okay, probably not, you'll use them in the Post->Tag queries).
Wouldn't that both be faster and less error-prone?
Hi Dennis, that's a good question!
GSI is a viable solution, yes. It would simplify the implementation, but also comes with its shortcomings.
GSIs can't offer consistency. There will always be a delay between changes to the table and reflections in the secondary index. For some use cases, this is unbearable. With the approach I suggested, it's possible to wrap many item operations into one transaction, what gives you full control and consistency.
Another issue with GSI is that the developer now has two sources for the same dataset and needs to discover on which to read data from.
Consider the Friendly example. If you want to find all Ross' siblings, do you need to query the table, the GSI or both? If your data structure is really really simple, this might not be a problem. Otherwise, it will complicate your life when implementing new features that need to read data. And can easily end up in a mess. You could start missing data that exists by querying the wrong source, and showing incomplete information to your users. And you may never find out...
In order to minimize errors when writing data using the approach I outlined, you could have a single internal service for entering nodes and edges in the DB. You take care of this single entry point and make sure it's handing all relationships correctly. Then all other parts of your application will only invoke this single service to write in the DB.
Does that make sense?
Hmmm, not sure I'd use DDB at all if eventual consistency was a problem, as I he transaction operations throws a lot of performance away (and are somewhat limited), but fair enough.
As we're getting to the connections between equals, like the siblings example, you'd need somewhat arbitrary rules (e.g. eldest has the relation), or do the copying anyway, and then we'd both have the extra redundancy AND the GSI to worry about, so the triplestore model makes a lot of sense there.
I'm not sure that I'm convinced for data with a clearer parent-child relationship, I'll have to let it simmer for a while. :-)
By the way, you're storing the predicates in the DB, but doesn't really use them for anything. I guess that's just for discovery (e.g. when using it as storage for clients that doesn't necessarily know what relation types that exist), but are there other uses for those in a less generalised database? They seem a little bit useless to me unless you add an attribute to make a sparse index from (otherwise you'd have to do a full table scan to find them), but I've probably missed something.
Thanks for the clarification!
Thank you for the great article :)
I think there may be typos in the Post/Tags example table, which may confuse readers. The repetition of PK:
Tag-1
, SK:Tag-1
for each of the tag valuescool
,awesome
,neat
is invalid. Only one of them can be represented in the composite partition key.The rest of the text referring to the example references
Tag-2
.This is a great concept but the implementation within the article is frustratingly non-intuitive. It’s amazing we’re in the year 2020 (13 years since the dynamo paper was written) and there’s only 1 Serverless Graph Database, Microsoft Azure CosmosDB Autopilot ... which sucks because we’re using AWS. There are so many graph databases out there yet only one serverless option
Hi, are you familiar with Cloud Directory? It's not precisely a graph, but follows essential concepts of a graph database. And fully serverless! 😉
Great examples, thanks!
Glad it was helpful! I struggled with examples in the AWS official docs, wanted to share something easier to grasp.
Yes, you have to read them a few times before it actually sets in. I especially like the graph example, don't think I have seen that before.
in the many to many example, these all have the same key..likely typo but
Tag-1 Tag-1 cool null
Tag-1 Tag-1 awesome null
Tag-1 Tag-1 neat null
is a lot different from this
Tag-1 Tag-1 cool null
Tag-2 Tag-2 awesome null
Tag-3 Tag-3 neat null
Great example, you save my day. Thanks a lot