As Werner Vogels says: "Everything fails all the time."
Data is the new oil. We rely on it not only to make decisions but to operate as a business in general. Data loss can lead to significant financial consequences and damaged reputation. In this article, you can find ten actionable methods to protect your most valuable resources.
This goes without saying, and we all know it. We need to have a backup strategy and an automated way of regularly taking periodic snapshots of our databases. However, with today's large amounts of data, implementing a reliable backup plan that can quickly recover your databases becomes challenging. Therefore, it is crucial to develop a strategy of Recovery Time Objective and Recovery Point Objective and implement a solution that can satisfy your Business Continuity plan.
Recovery Point Objective (RPO) describes how many hours of downtime we can tolerate. RPO of 10 would entail that your business can afford* no more than 10 hours of data loss* according to your Business Continuity Plan. You could think of RPO in terms of the "staleness" of your backup, plus the recovery time. With RPO=10, we allow our data to be 10 hours stale after restoration, i.e., not containing changes made within the last 10 hours.
In contrast, Recovery Time Objective (RTO) describes within which time the database must be up again. RTO of 3 would mean that regardless of the backup freshness, the database must be up and running within 3 hours after the downtime occurred.
Probably the worst-case scenario is that you developed a backup strategy, and you are regularly taking snapshots, but when the failure happens you notice that those backups aren't working as intended or that you can't find them. It's critical to test the recovery scenario.
Netflix pioneered "Chaos Engineering" --- a discipline of testing failure scenarios on production systems to be sure that your infrastructure is truly resilient.
Read more about how to test serverless apps.
Don't count on backups and recovery plans that have never been tested. Otherwise, you risk ending up in the "cross your fingers and hope for the best" strategy.
Note that if you rely on backups taken by some fully managed service where you don't have actual access to the snapshot, you risk that restoring your database may take longer than your RTO and RPO strategy allows. It's possible that due to time-zone differences and a large volume of data that may need to be transferred over a long distance, the recovery may take longer than you expect. Therefore, it might help to take regular snapshots yourself rather than solely relying on backups from a specific provider.
If your database goes down, which processes are affected? It's valuable to have this information documented somewhere to mitigate the impact of a failure and being able to recover quickly by restarting corresponding processes and mitigating the impact of downtime.
We all want to trust people, but allowing too much access to developers without educating them on how to use those production resources may backfire. Only a few trusted people (likely DevOps or senior engineers) should have direct access to modify or terminate production resources. When building any IT solutions, it's best to work on a development database and have read-only permissions to production resources.
On top of that, it's advisable to check those permissions regularly. If you haven't done so in a while, take this as your sign. Perhaps somebody who has left a company still has access to production resources?
What if your production database is not named as a "prod" resource and somebody confuses it for something else? It's best practice to ensure that production resources are named properly so that already by looking at it people know that this is a resource that must be treated with great care.
It may seem obvious to you but without proper communication and educating users, somebody could confuse a poorly named production database for some temporary resource (for instance, a playground cluster) that can be shut down.
If your resources are configured manually, it becomes more difficult to reproduce the configuration in a failure scenario. Modern DevOps and GitOps culture introduced a highly useful paradigm of Infrastructure as Code, which can significantly help to build an exact copy of a specific resource for development or recovery scenarios.
It can be challenging to recover any specific system if the only person who knows how to configure and use it is not available when the failure happens. Knowledge silos are particularly dangerous in such use cases. It's beneficial to have at least one additional person that can take over this responsibility. Often even a timezone difference between employees can significantly contribute to fixing any production downtime faster, and therefore, to meet your RTO.
This point is related to preventing knowledge silos but more directed towards educating developers. Anytime we give somebody more than just read-only access to production resources, we should educate them on using this resource properly and what impact a potential downtime of a single table may have. As always, effective communication is our best friend.
Using data stores such as AWS RDS is great, but it has a downside that, in the end, we are still responsible for ensuring that our database remains healthy. When using serverless data stores such as DynamoDB, we can rely on AWS DevOps experts to monitor and keep the underlying servers healthy.
If you leverage an observability platform, such as Dashbird, you can quickly identify misconfigured resources or failures within your serverless infrastructure. Dashbird has recently released a feature called Well-Architected Insights that continuously scans your resources for anomalies. For instance, it will alert you about any DynamoDB table that doesn't have a continuous backup and Point-In-Time-Recovery enabled. This is one of the easiest ways of ensuring that your data store remains healthy and resilient because:
- AWS takes care of serverless compute and storage behind the service, ensuring High Availability and Fault Tolerance,
- Dashbird will alert you if your architecture deviates from standards defined within the Well-Architected Framework, such as when your resources are not properly configured or lack backup.
In the image below, you can see that Dashbird automatically detected that backup is not enabled:
Well-Architected Insights ensures that your DynamoDB tables have a continuous backup enabled for a quick point-in-time recovery --- Image: courtesy of Dashbird
In addition to recovery information, you can discover many more insights about your serverless resources, as demonstrated in the image below. For instance, you will be informed any time your real-time data streams have write-throttles. In the end, you are presented with a score of how well your architecture adheres to the Well-Architected Framework.
Well architected lens --- Image: courtesy of Dashbird
And if the only reason that holds you back from using DynamoDB is that you still want to use SQL, you may have a look at PartiQL. This query language, developed by AWS, allows you to query your DynamoDB tables (and many other data stores) directly from the AWS management console, as demonstrated in the image below.
This point is related to analytical databases. It's a good practice in analytical data stores if your compute and storage are independent of each other. Imagine that your data is durably stored in object storage such as S3, and you can query it with a serverless engine such as AWS Athena or Presto. The separation of how your data is stored and how it's queried makes it easier to ensure the resilience of your analytical infrastructure.
You can establish automatic replication between S3 buckets, enable versioning (allowing to restore deleted resources), or even prevent anyone from overwriting or deleting anything from S3 by leveraging object locks. Then, even if your Athena table definition is deleted, your data persists and can easily be queried upon a definition of schema in AWS Glue.
I'm a big fan of storing raw extracted data for ETL purposes into object storage before loading it to any database. This allows using it as a staging area or data lake and allows more resiliency in analytical pipelines. Relational database connections are fragile. Imagine that you are loading large amounts of data from some source system directly into a data warehouse. Then, shortly before the ETL job would be finished, it fails because the connection was forcibly closed by a remote host due to some network issues. Having to redo the extraction step can introduce an additional burden on the source system or may even be impossible due to API request limits.
In this article, we examined ten ways to protect your mission-critical data store. These days, data is such a critical resource that downtime can cause significant financial and reputation losses. Make sure to approach it strategically and test your recovery scenario.