Databases are the heart of every system. They also tend to be the source of many problems.
In this post I would like to summarise best practices for implementing databases and caching for cloud native apps.
Many developers and many architects start with RDB as a default data store. RDB market is the most mature one, there are countless database frameworks available, many developers know SQL really well.
You can use RDB as key-value store, full-text search engine, JSON documents store, or time-series store.
All above is true. But does one size really fit all?
As a refresher here are some database types you should remember about when designing/implementing your cloud native app. They are highly specialised databases designed to do their job really well.
- Key Value
- Full-text search
You can find an excellent summary of those in AWS database paper Enter the Purpose-Built Database Era. The paper contains description of each of the technology, examples, pros and cons. I highly recommend you read it!
If you run a cloud native app, perhaps even multi-tenant one, you are for sure using infrastructure-as-code paradigm. I touched on this already in my previous article in this series: Building cloud native apps: Codebase.
Database changes and database versioning should be no different. You must not apply changes to your databases in a manual way. Further every database change should be versioned, and this is especially important if you have, for example, 6 scrum teams working on the same product.
While database schema changes happen for all database types (including NoSQL ones) in case of RDB you must make explicit changes in both database schema and application code.
For relational database migrations and versioning, I'm using migrator. It's a super lightweight and super fast database migration tool written in Go. It runs as a docker container and exposes simple yet powerful GraphQL API. All migrations applied by migrator are grouped into versions for traceability, auditing, and compliance purposes.
migrator supports PostgreSQL, MySQL (and all its major flavours), and MS SQL. It supports multi-tenancy out of the box. It can read migrations from local disk, AWS S3, or Azure Blob Containers. It also out-performs other database migration frameworks like the leading Java tool FlywayDB. It can also sync already executed migrations from legacy frameworks.
I cannot stress enough how important caching is. Caching should be implemented at multiple levels. Just like CPU caches are organised into hierarchy of L1, L2, some even use L3 and L4 caches. You should investigate implementing caching strategy in your distributed applications. For a reference this is how I do it in my SaaS product:
- Hibernate L1 cache - session-level cache, enabled by default;
- Hibernate L2 cache - session factory-level cache, application-wide cache for data entities, we use Infinispan as a L2 cache implementation, other options include JBoss Cache2;
- Application cache - application-wide cache for business objects, implemented using Guava Cache;
- Distributed cache - implemented using a cluster of Redis nodes;
- Web content cache - not related to databases, but mentioning for completeness, all the static content like web pages, JS, CSS should be cached (and served by CDN).
Caching can drastically improve performance of your application. Depending on the scale of your operations this can be hundreds of millions of operations a day!
Important thing to remember is that different types of data may require different expiration time. It's best to write down all the requirements beforehand. You don't want to have stale data in your caches, on the other hand too short expiration times will lead to performance degradation.
Connection pooling is a must for session-based database like for example RDBs. Connection pooling prepares, opens, and maintains ready-to-use database connections. When application needs a database connection it fetches one from the pool and returns it when no longer needs it. Connection pooling helps to better manage the database resources, greatly removing database connection fluctuations.
For those running serverless applications (for example running thousands of functions) and using relational databases as backends the fluctuations of database connections can cause a lot of unnecessary processing on the DB side. If you are running on AWS I highly recommend you to use AWS RDS Proxy in front-of your AWS RDS database. Have your Lambdas talk to the proxy and the proxy will offload all the work required to pool and maintain database connections for your serverless application.
I assume you do the basic stuff already: CPU, memory, disk space, read IOPS, write IOPS.
If you see high CPU usage or free memory dropping to zero it's information that something is wrong. But these metrics won't tell you what.
You need to have a more in-depth monitoring.
We used to have a pretty detailed monitoring of our data access layer at the application level. However, it was producing inaccurate data because of Hibernate L1 and L2 caches. At the application level we were recording an event of a DB operation where in fact that operation was reading data from L2 cache.
So what we learnt is that the most accurate data regarding the database comes from the database itself.
I also went one step further and asked my DevOps team to redesign VPC so that every type of the component got its own subnet. Why? With these fine-grained subnets we could generate network traffic stats from the VPC Flow Logs in a nice visual way. We had a deployment diagram on top of which a simple Python script was adding text labels with the information about network traffic exchanged by all system components.
I would like to finish this post with a true story. I'm running thousands of machines on a weekly basis. All those machines needed to fetch some metadata from a database. There were different metadata for different jobs. The metadata changed only when a new version was released (they were in fact static). Fetching metadata from database for every job resulted in hundreds of millions of unnecessary database calls. This is what we did:
- We started with the obvious one: implemented an application cache for metadata, it helped a lot, still didn't solve the whole problem, because we still launched a few thousands of machines and the application cache for a newly launched machine was always empty;
- Second idea was to offload those operations to our Redis cluster, it helped too, our database was happy, but Redis wasn't happy that much;
- Third idea was to package the metadata along the application code, it was the best of all the options, we didn't have to make any remote calls to either database nor Redis, instead we read the metadata (from Java resources) into the in-memory cache;
The takeaway here is that not everything has to be put in a database. Very often the simplest solutions are the best ones, and I'm a huge fan of the KISS principle.