Understand and Optimize Amazon Aurora Global

#aws #aurora #database #architecture

Introduction

AWS Aurora database supports global multi-region setup including a primary and a secondary region. When engineering with Aurora Global, the default settings are great but understanding all the available configuration options and how those come together, saves time and effort. This article explains Global Write Forwarding and its effects in detail which is a very handy setting that lets applications read and write from applications running on both primary and secondary regions.
Note that this article is not about Aurora DSQL, which is a different service that supports active-active setup out of the box.

Aurora Defaults

Aurora Global Database sets up a writer and a reader in the primary region plus a reader and a standby writer instance in the secondary region. The standby writer will be promoted to a writer during a region fail over where the secondary region becomes primary.
There are endpoints created to connect to the 3 of the 4 instances (shown in Figure 1), a read-write endpoint to interact with primary writer, a read-only endpoint to interact with primary reader and a read-replica endpoint to interact with secondary reader.

Based on how the security groups are configured, these endpoints can be accessed from both the regions, but interact only with their respective DB instance.
This default setup works best as an active-standby setup where the application on the secondary region can be available but not actively serving any traffic involving the database connection. Configuring cross region access is possible but inconvenient to maintain.

Global Endpoint

A global Endpoint can be enabled, which proxies the primary region writer endpoint by default. The purpose of a global endpoint is that the application using the endpoint does not need to change when a region fail over event occurs, in which case the global endpoint switches to point at the newly promoted writer.

This setup can work with active-active compute, when using the same global endpoint in both the regions and the application can tolerate the latency everywhere a database access is involved on the secondary region.

Global endpoint connects only the primary writer. The reader instances in either region are not used unless those endpoints are explicitly coded as application data sources. See Figure 2

Global Write Forwarding

We had established that the default settings combined with the usage of the global endpoint is great for region fail over scenarios.
If you coded the application to use global endpoint for writes and local reader endpoints for reads, you care about sharing the load in read-heavy/ write-heavy workloads.
One of the settings available is to enable write forwarding

When this setting is enabled, using the global endpoint from the secondary region, reads locally and writes to the primary region, without needing any additional application code to switch between reads and writes.

This setting has an effect only on the secondary cluster. The primary region will need additional code to separate reads and writes.

This is great for read-heavy applications letting them remain active-active, providing low latency local database reads. However this brings a challenge of distributed data consistency.
See Figure 3 for updated connectivity patterns with and without global write forwarding.

Consistency

If you observe the path 1->2->3 in Figure 3, which is the data flow from the secondary region when write forwarding is enabled, you see that we are writing and reading from a different clusters, relying on Aurora to keep the data consistent. Thankfully Aurora handles this by default, but the question remains, is the default settings good enough for your use case?
There are three ways Aurora provides read consistency, session, eventual and global. For write forwarding to work, the consistency setting must be configured correctly.The defaults are also different in MySql and Postgres.
Session consistency is the default setting in Postgres; it limits blocking and works in most cases but click through the links above to see how global and eventual work before choosing one.

Replication lag

If there is a delay in replication between the regions, write forwarding based on its consistency configuration, will not work correctly. If the application tolerates, there are settings to block the writes until lag clears to prevent additional damage. The recovery point objective setting controls the lag and is set to 1 minute by default, meaning Aurora tries to keep the lag under 1 minutes, it can be adjusted but cannot go any lower than 20 seconds.

Conclusion

Understanding the configuration options available with Amazon Aurora database and their trade-offs help in choosing the optimal setting for your workload. Default settings are a good starting point, but sometimes we learn the trade-offs the hard way, like during a production incident when the settings fail under specific conditions. Fully understand the access patterns and related configurations before using them.

DEV Community