How do I choose a database on AWS? How do I choose between NoSQL (Amazon DynamoDB) vs SQL (Amazon RDS)? When building a serverless microservice on AWS, I will walk you through how I narrow down what the best database on AWS is for my microservice.
Database options on AWS
When it comes to storing data in AWS, at the time of writing, there are over 15 different options. The options for our microservice’s persistent store are the following:
- Amazon Aurora
- Amazon RDS
- Aurora
- PostgreSQL
- MySQL
- MariaDB
- Oracle
- SQLServer
- Amazon Redshift
- Amazon DynamoDB
- DocumentDB
- Amazon Keyspaces
Not included here are the services for graph data and memory caching as it is not applicable for our use case.
Why choosing the right database is important
When choosing a database, it is important to make an informed decision to avoid issues down the road caused by cost, performance, or even worse, having to migrating to another database to support our business model. There are a few questions we need to ask ourselves when choosing a database.
Do we need a NoSQL or SQL option?
Here is the JSON object we need to be able to support. If you want to understand how we arrived at this JSON object, read How to design a RESTful API on AWS.
{
"status": "ACTIVE",
"public": true,
"id": 1234556,
"title": "Weird Animal Facts",
"description": "Tune in to learn about how horses make the neighing sound",
"dateTime": "2021-06-30T08:00:00+01:00",
"duration": {
"length":"1",
"unit":"HOUR",
},
"locations": [
{
"type": "ONLINE",
"url": "https://ivonneroberts.com/meetingURL"
},
{
"type": "VENUE",
"address": {
"streetAddress1": "1 5th Avenue",
"streetAddress2": null,
"city": "New York",
"state": "NY",
"zip": 10001
}
}
]
}
A normalized form of this data could be represented by an Event
table, a Schedule
table and a Location
table. Let’s analyze these entities to be able to determine which type of database on AWS we could use, NoSQL or SQL.
The Event
table could have a 1:many relationship with Schedule
. You can imagine a scenario where an event could have a repeating schedule. That Schedule
table could have multiple entries for a given event. To decide NoSQL versus SQL, let’s look at the update query. How many event records would need to be updated, if an instance needs to be changed? We would need to update one record. This makes it a candidate for NoSQL database.
The Event
table could also have a 1:many relationship with Location
. An event could be both online, as well as, at a venue. Asking the same NoSQL versus SQL, let’s look at the update query. If a location needs to be updated how many event records would need to be updated? The answer is n events that use this same venue. You would have to query the table to find how many events have that venue and then update each one. This is not a candidate for NoSQL. However, the attributes of a location, have a lot of details or nuances that have nothing to do with an event. In fact, I would argue Location
should be in its own microservice. The Event REST API response should instead only include a location id, and that information stored as part of another microservice.
With those questions answered, the remaining data in our microservice now could be stored in a NoSQL database.
How the data will be queried
For these first two urls, a NoSQL implementation (specifically Amazon DynamoDB) would be a scan on the full table. As the data grows, scans are not cost efficient, nor performant. We can mitigate this problem by introducing a range/sort key in addition to our primary/partition key, for example category. As a client, I am subscribed to the YouTube category. These APIs could use your authentication to determine your eligible categories, query the DB for them and then further filter either by date or other attributes of the event.
https://api.mydomain.com/events/v1/events/
https://api.mydomain.com/events/v1/events/search?date[gt]=2021-07-01&status=CANCELLED
This next API is straight forward. We can query our NoSQL database by the event id (partition key in DyanmoDB).
https://api.mydomain.com/events/v1/events/{eventId}/
Costs associated with databases on AWS
The next question we need to ask is what are the cost consideration. When choosing a database there is both the cost of the actual database/license, as well as, the people cost to maintain and fine tune that database. For some of the databases on AWS you pay for what you use/store, most of the heavy lifting is abstracted from you. Other databases require EC2 instances, additional licenses in addition to the service. Also, depending on the database, you might need a database administrator to update, manage and fine tune that database. This can be cost prohibitive for a startup.
Depending on where you are in your journey it is acceptable to start with the minimum requirements, and then at a later date take the downtime and migrate to another database that better meets your needs.
Finalized JSON structure
Now that we have reviewed the considerations for choosing our database, and we have finalized on a NoSQL option. Let’s update our JSON payload to reflect the changes we called out. First we have added a category attribute, this will be our range/sort key. Second, we have added a schedule attribute that is a list of all our instances of this event. Lastly we have updated the location object to now simply refer to an ID that references a resource in a separate location microservice.
{
"status": "ACTIVE",
"id": 1234556,
"category": "DEVWIDGETS_YOUTUBE",
"title": "Weird Animal Facts",
"description": "Tune in to learn about how horses make the neighing sound",
"schedules": [{
"dateTime": "2021-06-30T08:00:00+01:00",
"duration": {
"length":"1",
"unit":"HOUR",
}
}],
"locations": [
{
"id": 456
},
{
"id": 789
}
]
}
Conclusion
So after much deliberation we have finalized our JSON payload and have decided to use NoSQL. On AWS our options are Amazon DynamoDB, Amazon DocumentDB, and Amazon Keyspaces. For this microservice I am going to use Amazon DynamoDB as it is a fully managed service. It is a fairly robust database that abstracts a lot of the heavy lifting associated with managing a database. Most importantly it is serverless which means I don’t need to concern myself with EC2 pricing, patching and troubleshooting. And, to name a few more key features:
- Fully managed
- multi-region
- multi-active
- built-in security
- backup and restore
- in-memory caching
- configurable autoscaling
We now have walked through how to choose a database on AWS Cloud. Next week, I will continue to build on this design and implement some of these APIs on AWS. Check back weekly for new content. Feel free to comment below any questions you may have or reach out to me on twitter at @ivlo11
Latest comments (0)