DEV Community

Cover image for Data inconsistency in AWS Amazon Aurora Postgres solved with Local Write Forwarding?
Paul Elliott
Paul Elliott

Posted on

Data inconsistency in AWS Amazon Aurora Postgres solved with Local Write Forwarding?

I'm a big fan of Postgres. I'm also a big fan of AWS Aurora Postgres. While working as a consultant optimising databases for clients, I witnessed first hand the amazing scalability that's possible with these two technologies. But it's not all sun and roses.

YouTube has many excellent videos on the architecture behind AWS Aurora. At a very high level, AWS takes the upstream Postgres, replaces the storage engine and makes some other modifications to the planning engine. The storage engine is the biggest change though, offloading the data into dedicated storage clusters isolated from the underlying compute. This provides a huge number of benefits, but also causes one major problem, or it did, until recently.

Aurora clusters are formed of a single writer instance and zero or more reader instances. I'm disregarding Aurora Serverless, as that's a whole other beast and a topic for another day. In the simplest setup, the cluster provides a single writer endpoint and a single reader endpoint. Clients send read only queries to the reader endpoint, which uses DNS round robin to (dumbly) route traffic to the reader instances. Clients send INSERT, UPDATE and DELETE traffic, unsurprisingly, to the writer endpoint, which will always route traffic via DNS to the current writer instance.

The writer instance and the reader instances all point to the exact same storage backend, as it's shared across all instances. This means that when the writer successfully commits a change to storage, the updated change is available through the storage layer to all the reader instances synchronously, with zero lag.

So if we commit a change via the writer and then perform a query on a reader, after successfully committing the change on the writer, we'll get our new data back, right? It depends.

Although it's true that the underlying data on disk is always consistent between the writer and reader instances, as it's the exact same blocks of storage referenced by both, there's one area that's not always consistent.

Aurora uses Linux on EC2 for the underlying compute. The Linux kernel uses a page cache to retain data from recently accessed blocks in volatile memory (RAM). And this is what can lead to inconsistent query results. Consider this scenario:

  1. The writer instance receives a query updating a row.
  2. The writer sends the update to the storage layer.
  3. The storage layer commits the change and returns a success to the writer instance.
  4. The writer instance returns to the client that the transaction was successful.
  5. A reader instance receives a read-only query for the same row that was just updated.
  6. The reader has a large amount of RAM, and that row is already in the page cache, so it skips going to storage and simply returns the row from the page cache.

See the problem? The reader instance skipped the lookup to the backend storage as it believed it already had the latest available data to return. Meanwhile a background process runs between the storage layer and compute layer, which invalidates the page cache blocks when the underlying block has changed. Unfortunately there's a small, but significant, latency to this process which results in this issue. Let's validate that latency.


I'll be using an Aurora Postgres cluster using engine version 16.4, a single writer instance and a single reader instance, both running db.t4g.medium instances. As a test client I'll be using a t3a.micro EC2 instance. The writer, reader and test client are all in different AZs, in the same eu-west-1 region.

For testing I'll be using a simple Python script. The script will perform the following actions:

  1. Open database connections to the writer, reader or both, depending on the test scenario.
  2. Update a row within a table with a counter starting at 0 and increasing sequentially up to the maximum number of repetitions.
  3. Wait an increasing amount of time, starting with no wait and going up to 100 milliseconds.
  4. Read the same row back and check if the counter has the new value (consistent), or still has the old value (inconsistent).
  5. Repeat steps 2-4 until reaching the maximum number of 10,000 repetitions.

The time taken between steps 2 and 4 will also include the time it takes the script to run, which needs to be taken into consideration. So let's work out how long that will be. Let's run without any delay in step 3, with both writes and reads going to the writer instance. This will be our absolute best case scenario in terms of code latency, and in this scenario, the code takes (M=0.12795ms, SD=0.000155) measured from the successful commit of the transaction in step 2, to the issuing of the read request in step 4 across 100,000 repetitions. Pretty quick, and more than quick enough for our testing.

Now we know how quick we can read data back, we can now start to see if a read immediately after a write will return the data we expect. Here's the results when writing to the writer, and reading from the reader.

Writes to writer, reads to reader. No write forwarding.

Let's walkthrough what we can see here, as it's hugely significant.

In our scenario, when we make an update through the writer instance and then read the data back from a reader instance, with less than 25 milliseconds between the queries, you'll get the wrong data back. When updating and then immediately reading back, we see a failure rate (the red line) of 99.36%, dropping to 59.96% when an artificial delay of 12.5 milliseconds is added. This is fundamentally at odds with what an ACID compliant database should be doing. (Although impossible to see on the chart, we still get a failure rate of 0.02% with a 50 millisecond delay added).

Having previously identified what the issue is likely to be here, the page cache, let's repeat the same test but sending both the updates and the reads to the writer instance.

Writes to writer, reads to writer. No write forwarding.

Problem solved, and theory confirmed! We no longer see any inconsistencies when reading data back immediately after updating it.

Except, this is a really bad idea.

In Aurora Postgres, you're typically limited to a single writer instance and up to 15 reader instances. I say typically, as recently Aurora Postgres Limitless went GA, which provides horizonal autoscaling of writer instances, but that's a topic for another day as it has some significant design details which need taking into consideration. Putting the Limitless product aside, this means that you will always be limited to a single writer instance within any single Aurora Postgres cluster. So the writer instance should only be used for queries performing updates, with all other queries handled by autoscaling reader instances. But what other option do we have? If we're performing updates and then needing to read that data back with consistency, based on these findings we have to use the writer instance, or add in an artificial delay, don't we?

This was true until the general availability launch of Aurora Postgres local write forwarding, which solves this particular problem, with some caveats.

Local write forwarding is a feature of Aurora Postgres (and Aurora MySQL) which allows you to send a consistency level for writes, which are then sent to reader instances. The reader instance receiving the traffic will identify the query as an update, and forward it on to the writer instance. Depending on the consistency level requested, it will then optionally wait for the change to become consistent to the specified level, before confirming the transaction as a success to the client.

Local write forwarding is enabled at the cluster level, and once enabled clients can specify the level of consistency they need when using the feature:

OFF: Disabled, updates sent to a reader instance will fail immediately.
SESSION: The default on a cluster with local write forwarding enabled. This means that any changes made within a single session will always be consistent within that session, but may not be consistent in other sessions.
EVENTUAL: This allows for updates to be sent to reader instances for forwarding to the writer instance, but provides absolutely no guarantee that the data will be immediately consistent.
GLOBAL: The sledge hammer setting. This ensures that all updates sent through the session are replicated to all reader instances before the transaction returns.

This should solve our page cache data consistency issue, and it even allows us to set the desired consistency required on a per client basis, which is fantastic. Let's check it works by repeating our previous tests. We'll start with EVENTUAL consistency, where we should still expect to see failures.

All queries to reader. Eventual write forwarding.

As expected, we continue to see failures at a similar rate to before. Now let's try with SESSION consistency.

All queries to reader. Session write forwarding.

No more failures! And finally, let's try with GLOBAL consistency.

All queries to reader. Global write forwarding.

As expected, no failures. But at what cost? Let's have a look at those latency figures with the zero added delay scenario.

Local write forwarding latency comparison.

Let's look a bit closer at the numbers:

Consistency Level Latency (ms) Increase from disabled
disabled 4.461590695 0%
EVENTUAL 5.70614152 27.89%
SESSION 5.927728486 32.86%
GLOBAL 6.418921375 43.87%

A whopping 43.87% increase in latency compared to not using local write forwarding and this is on an otherwise completely empty, isolated and idle cluster, an entirely unrealistic prospect in the real world.

Now that sounds like a big increase, but the latency figures are still under 10ms across the board. How that scales with a real-life production workload, is entirely dependent on the workload in question. Using load testing tools such as locust, and some careful modelling of query patterns, it's possible to simulate such a workload. This would allow that question around scaling to be answered.

When adding local write forwarding, Amazon have included new wait states to Performance Insights. The chart below shows a reader instance actively forwarding traffic to the writer instance. These new metrics will be really useful as production workloads move onto clusters with local write forwarding enabled, helping to diagnose situations when the feature is causing unexpected bottlenecks.

Local write forwarding new metrics.


Overall I would consider local write forwarding a big win, even with the latency penalty shown above. The ability to remove all traffic from the writer instance and throw everything at the readers makes life a lot simpler for developers, without having to worry about consistency issues. I highly recommend people have a play and see how it performs.

If you're interested in the raw data behind this blog post, spot any inaccuracies, or would like to add anything, please do get in touch.

Top comments (0)