DynamoDB Global Tables replicate data across regions in seconds, but replication is still asynchronous.
That means a simple read from a replica region can occasionally return stale data, which is acceptable in most application as the user doesn’t require the latest available data all the time, but in some systems, stale reads can break important processes and stability of a platform.
So the question becomes:
How do you detect and react to stale replica reads automatically?
So, I set out to find the answer to this question — that’s when I found AWS Fault Injection Simulator (FIS) service, which in combination with DynamoDB Global Tables, can intentionally create failure scenarios and observe a resilient read path responds.
Instead of only testing infrastructure availability, this experiment also focuses on application-level resilience. To be more precise, I’m going to show you the code which can detect stale reads, reroute traffic and recover from the chaos scenario automatically.
Also, I’m going to show you how to experiment with your Lamdbas which are in this stack as well, we are going to take a look at how to inject delays and errors in your Lambda invocations.
The full project code is available in this GitHub repository link.
What This Project Demonstrates
The CDK-based stack deploys a multi-region architecture designed to simulate and handle real-world replication issues.
The stack consists of:
- A primary DynamoDB region (
us-east-1) that handles all writes - A replica region (
eu-central-1) that normally serves reads - A reader Lambda that detects stale data and falls back to the primary when needed
- A rerouter Lambda that automatically shifts read routing when replication becomes unhealthy
- AWS Fault Injection Simulator experiments that intentionally introduce failure scenarios, which are:
- DynamoDB Global Table replication failures
- Lambda delays
- Lambda failures
Together, these components allow us to observe how applications behave under replication lag, Lambda failures, and latency injections.
Core Design: Resilient Read Routing
The most interesting part of the architecture of this project is the reader Lambda, which is acting as the system’s resilience layer.
While under the normal conditions:
- the reader queries the replica region
- the system returns the result with low latency
However, if the data replication is falling behind, the Lambda detects the stale data by using the written_at timestamp.
When stale data is detected, the Lambda performs three actions:
- If the replica data is older than 10 seconds, it is considered stale
- The reader falls back to a strongly consistent read from the primary
- The system self-escalates by updating an SSM parameter (
/fis-global/read-routing) - Future reads temporarily bypass the replica entirely
This prevents the user from getting repeated stale reads and avoids performing per-request staleness checks during degraded conditions.
The following snippet is just a part of the reader Lambda, the full code can be seen inside the Github repository linked in the intro of this blog post:
STALENESS_THRESHOLD_MS = 10_000 # 10 seconds
def is_stale(item: dict) -> tuple[bool, int]:
"""
Returns (is_stale, age_ms).
written_at is stored as a Decimal by DynamoDB — convert safely.
Only flags stale if age exceeds STALENESS_THRESHOLD_MS.
"""
written_at = item.get("written_at")
if not written_at:
return False, 0
written_at_ms = int(Decimal(str(written_at)))
now_ms = int(time.time() * 1000)
age_ms = now_ms - written_at_ms
stale = age_ms > STALENESS_THRESHOLD_MS
return stale, age_ms
def escalate_to_primary():
"""
Write /fis-global/read-routing = 'primary' to SSM and update the
local cache. The rerouter Lambda resets it to 'replica' when the
CloudWatch alarm clears.
"""
try:
ssm.put_parameter(
Name=READ_ROUTING_PARAM,
Value="primary",
Type="String",
Overwrite=True,
)
_ssm_cache["value"] = "primary"
_ssm_cache["fetched_at"] = time.time()
logger.warning("SSM escalated to 'primary' — stale replication detected")
except Exception as exc:
logger.warning("SSM escalation failed (will retry next invocation): %s", exc)
# In the handler: replica read → staleness check → fallback → escalate
if routing == "primary":
item = read_item(table_primary, pk, sk, consistent=True)
read_from = "primary"
routing_reason = "ssm_flag"
else:
item = read_item(table_replica, pk, sk, consistent=False)
read_from = "replica"
if item:
stale, age_ms = is_stale(item)
if stale:
logger.warning("STALE DATA: pk=%s age=%dms — falling back to primary", pk, age_ms)
item = read_item(table_primary, pk, sk, consistent=True)
read_from = "primary"
routing_reason = "stale_data"
stale_detected = True
escalate_to_primary()
# ... replica_miss_fallback when item not on replica yet ...
Automated Read Rerouting
While the reader Lambda handles per-request resilience, the architecture also includes automated infrastructure-based routing recovery by implementing a CloudWatch alarm which monitors a specific DynamoDB metric — ReplicationLatency.
If the latency exceeds 5 seconds, the alarm triggers an SNS notification, which invokes the rerouter Lambda. Of course, the latency threshold can be customized, based on your architecture needs.
The rerouter Lambda updates the routing parameter:
-
primaryduring degraded replication -
replicaonce replication returns to normal
Here is the router Lambda code example:
READ_ROUTING_PARAM = os.environ["READ_ROUTING_PARAM"]
ssm = boto3.client("ssm", region_name=PRIMARY_REGION)
def handler(event, context):
for record in event.get("Records", []):
message = json.loads(record["Sns"]["Message"])
alarm_state = message.get("NewStateValue")
alarm_name = message.get("AlarmName", "unknown")
alarm_reason = message.get("NewStateReason", "")
logger.info(
"Alarm notification: name=%s state=%s reason=%s",
alarm_name, alarm_state, alarm_reason,
)
if alarm_state == "ALARM":
new_value = "primary"
logger.warning(
"Replication latency alarm FIRING — rerouting reads to primary (%s)",
PRIMARY_REGION,
)
elif alarm_state == "OK":
new_value = "replica"
logger.info(
"Replication latency alarm RESOLVED — restoring reads to replica"
)
else:
logger.info("Ignoring alarm state '%s', no routing change", alarm_state)
continue
ssm.put_parameter(
Name=READ_ROUTING_PARAM,
Value=new_value,
Type="String",
Overwrite=True,
)
logger.info("SSM '%s' updated to '%s'", READ_ROUTING_PARAM, new_value)
return {"statusCode": 200}
Defining Chaos Experiments with AWS Fault Injection Simulator Templates
To validate the design, the project defines three chaos experiments using AWS FIS. Each experiment targets a different way the architecture in this stack can fail.
1. Replication Pause
The first experiment pauses replication for 3 minutes on the DynamoDB Global Table.
This creates a real-world scenario where:
- writes continue in the primary
- the replica falls behind
- the reader detects stale data
- the system falls back to the primary
This experiment validates the staleness detection and routing escalation logic.
from aws_cdk import aws_fis as fis
# ...
# FIS Experiment Template: Pause Replication + Reroute (3 minutes)
fis.CfnExperimentTemplate(
self,
"PauseAndRerouteExperiment",
description=(
"Pause replication for 3 min — reader self-escalates SSM to primary "
"on first stale detect; CW alarm resets it on recovery"
),
role_arn=fis_role.role_arn,
stop_conditions=[
fis.CfnExperimentTemplate.ExperimentTemplateStopConditionProperty(
source="none"
)
],
targets={
"globalTable": fis.CfnExperimentTemplate.ExperimentTemplateTargetProperty(
resource_type="aws:dynamodb:global-table",
resource_arns=[global_table.table_arn],
selection_mode="ALL",
)
},
actions={
"pauseReplication": fis.CfnExperimentTemplate.ExperimentTemplateActionProperty(
action_id="aws:dynamodb:global-table-pause-replication",
description="Stop replication to replica for 3 minutes",
parameters={"duration": "PT3M"},
targets={"Tables": "globalTable"},
),
},
tags={"Experiment": "GlobalTable-PauseAndReroute"},
)
2. Lambda Latency Injection
The second experiment introduces artificial startup delays in the Lambda reader.
Using the FIS Lambda extension, we inject:
- 12-second invocation delays
- lasting 2 minutes
Even when replication is healthy, this delay can cause the reader to observe stale data relative to new writes.
This tests whether the system can handle latency-induced inconsistencies.
from aws_cdk import aws_fis as fis
# ...
fis.CfnExperimentTemplate(
self,
"InvocationDelayExperiment",
description=(
"Inject 12 s startup delay into 100% of reader Lambda invocations for 2 min. "
"Delay > STALENESS_THRESHOLD_MS → stale-data fallback fires even with healthy replication."
),
role_arn=fis_role.role_arn,
stop_conditions=[
fis.CfnExperimentTemplate.ExperimentTemplateStopConditionProperty(
source="none"
)
],
targets={
"readerFn": fis.CfnExperimentTemplate.ExperimentTemplateTargetProperty(
resource_type="aws:lambda:function",
resource_arns=[reader_lambda.function_arn],
selection_mode="ALL",
)
},
actions={
"addDelay": fis.CfnExperimentTemplate.ExperimentTemplateActionProperty(
action_id="aws:lambda:invocation-add-delay",
description="Add 12 s pre-handler delay to every reader invocation",
parameters={
"duration": "PT2M",
"invocationPercentage": "100",
"startupDelayMilliseconds": "12000",
},
targets={"Functions": "readerFn"},
),
},
tags={"Experiment": "Lambda-InvocationDelay"},
)
3. Lambda Invocation Errors
The final experiment forces 100% invocation failures for any targeted Lambda, which in this case, is our reader Lambda, making the handler never execute, which simulates:
- dependency outages
- any of the runtime crashes happening during the Lambda initialization phase
- any of the IAM permission errors
This validates the replica error fallback logic, ensuring the application gracefully handles failures instead of returning errors to clients.
from aws_cdk import aws_fis as fis
# ...
fis.CfnExperimentTemplate(
self,
"InvocationErrorExperiment",
description=(
"Force reader Lambda to return an error on 100% of invocations for 2 min. "
"Triggers replica_error_fallback path in reader."
),
role_arn=fis_role.role_arn,
stop_conditions=[
fis.CfnExperimentTemplate.ExperimentTemplateStopConditionProperty(
source="none"
)
],
targets={
"readerFn": fis.CfnExperimentTemplate.ExperimentTemplateTargetProperty(
resource_type="aws:lambda:function",
resource_arns=[reader_lambda.function_arn],
selection_mode="ALL",
)
},
actions={
"injectError": fis.CfnExperimentTemplate.ExperimentTemplateActionProperty(
action_id="aws:lambda:invocation-error",
description="Return error response without executing reader handler",
parameters={
"duration": "PT2M",
"invocationPercentage": "100",
"preventExecution": "true",
},
targets={"Functions": "readerFn"},
),
},
tags={"Experiment": "Lambda-InvocationError"},
)
How to combat possible Lambda initialization issues when starting the FIS experiment
During the development of this project, I kept getting errors inside the Reader Lambda, saying that it’s not supporting the AWS FIS experiment. However, the solution was written about in the documentation.
Every AWS Lambda which is a target inside the AWS FIS experiment template, needs to have a special Lambda layer added to it’s configuration. The ARN of the layer, in my case, was:
"arn:aws:lambda:us-east-1:211125607513:layer:aws-fis-extension-x86_64:280"
and it can be found inside the Parameter Store service on AWS. Keep in mind that:
- your region
- Lambda architecture (x86 or ARM),
play a factor in determining the layer ARN.
In addition, to run the FIS experiment on a Lambda, it needs a S3 bucket to store the logs and you need to define the AWS_FIS_CONFIGURATION_LOCATION and AWS_LAMBDA_EXEC_WRAPPER environment variables.
AWS_FIS_CONFIGURATION_LOCATION needs to be set to the ARN of your S3 bucket, while AWS_LAMBDA_EXEC_WRAPPER needs to be set to /opt/aws-fis/bootstrap .
More information can be found in official AWS docs by clicking on the link here.
The final CDK code for the Lambda looks as following:
reader_lambda = lambda_.Function(
self,
"ReaderLambda",
function_name="fis-global-reader",
runtime=lambda_.Runtime.PYTHON_3_12,
handler="reader.handler",
code=lambda_.Code.from_asset("lambda", exclude=[".venv","venv","__pycache__","cdk.out","*.pyc","node_modules",".git"]),
role=lambda_role,
# Timeout must exceed max FIS startup delay (12 000 ms) + execution time.
timeout=Duration.seconds(30),
memory_size=256,
architecture=lambda_.Architecture.X86_64, # Make sure that your architecture matches the required FIS layer!
layers=[
lambda_.LayerVersion.from_layer_version_arn(
self,
"FisExtensionLayer",
# Be sure that you have the layer below defined,
# the FIS experiment won't work otherwise!
"arn:aws:lambda:us-east-1:211125607513:layer:aws-fis-extension-x86_64:280",
)
],
environment={
# ...
# FIS writes fault config to this S3 ARN; the extension layer
# reads it before each invocation.
"AWS_FIS_CONFIGURATION_LOCATION": fis_config_location_arn,
# Required alongside the FIS extension layer to intercept invocations.
"AWS_LAMBDA_EXEC_WRAPPER": "/opt/aws-fis/bootstrap",
},
log_group=reader_log_group,
)
Also, please make sure that you give the Lambda necessary IAM permissions! The following role will do the trick
lambda_role = iam.Role(
self,
"LambdaRole",
assumed_by=iam.ServicePrincipal("lambda.amazonaws.com"),
managed_policies=[
iam.ManagedPolicy.from_aws_managed_policy_name(
"service-role/AWSLambdaBasicExecutionRole"
)
],
)
global_table.grant_read_write_data(lambda_role)
read_routing_param.grant_read(lambda_role)
read_routing_param.grant_write(lambda_role)
chaos_param.grant_read(lambda_role)
# grant_read_write_data only covers the primary ARN; the reader Lambda
# also talks to the eu-central-1 replica which has a distinct ARN.
lambda_role.add_to_policy(
iam.PolicyStatement(
actions=[
"dynamodb:GetItem",
"dynamodb:Query",
"dynamodb:Scan",
"dynamodb:BatchGetItem",
],
resources=[
f"arn:aws:dynamodb:{replica_region}:{self.account}:table/{global_table.table_name}",
f"arn:aws:dynamodb:{replica_region}:{self.account}:table/{global_table.table_name}/index/*",
],
)
)
# The FIS Lambda extension layer (attached to reader Lambda) must be
# able to ListBucket and GetObject from the FIS config S3 bucket.
# Required by: https://docs.aws.amazon.com/fis/latest/userguide/use-lambda-actions.html#lambda-prerequisites
lambda_role.add_to_policy(
iam.PolicyStatement(
sid="AllowListingFisConfigLocation",
actions=["s3:ListBucket"],
resources=[fis_config_bucket.bucket_arn],
conditions={
"StringLike": {
"s3:prefix": [f"{fis_config_prefix}*"]
}
},
)
)
lambda_role.add_to_policy(
iam.PolicyStatement(
sid="AllowReadingFisConfigObjects",
actions=["s3:GetObject"],
resources=[f"{fis_config_bucket.bucket_arn}/{fis_config_prefix}*"],
)
)
# The FIS extension running inside the reader Lambda needs to call back
# to the FIS service to check active experiment parameters.
lambda_role.add_to_policy(
iam.PolicyStatement(
actions=["fis:GetExperiment", "fis:ListExperiments"],
resources=["*"],
)
)
Testing the System
To exercise the architecture, the project includes a Python test harness.
The script test_global_chaos.py runs several scenarios:
- Baseline testing under healthy replication
- Immediate read-after-write conditions
- Version drift during replication pauses
- Latency injection experiments
- Lambda error injection tests
These tests validate both data correctness and routing behavior under chaos conditions.
Starting the “DynamoDB Pause Replication” experiment
The test harness also includes a watch mode, doing the designed steps of fetching data:
- writing the data
- read the data from the replica database if there are no issues reported
While an FIS experiment runs, you can observe the system transitioning through multiple states:
- Normal replica reads
- Replica becomes stale
- Fallback to primary
- SSM routing escalation
- Replica recovery
- Routing reset
This creates a clear visualization of resilience behavior over time. You can run this experiment with the following commands:
# Deploy the stack first and make sure you have the region bootstrapped!
cdk bootstrap
cdk deploy
# Install the needed dependencies
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
# Run the Python script
python3 test_global_chaos.py --watch
Here is the transition of when the system is working as expected without the experiment running and then I manually start the experiment inside the FIS console — you can see the data beginning to be stale and the Lambda switches to the primary database right away:
After stopping the experiment, the system goes back to normal, taking data from the replica database:
I’ve made this script not start the experiment on purpose, so you can start and stop this experiment as you wish as other experiments are designed to be shorter in time.
Simulating the Lambda delay
Now, we are going to cause our reader Lambda to have a 12 second delay before running the handler code. You can start this experiment with:
python3 test_global_chaos.py --scenario delay
and everything will be started for you, the output should like:
It is clear that the experiment is working — you can see that the Lambda is reporting that the data is stale and the execution time is easily above the 12 second mark.
Simulating the Lambda errors
Finally, we are going to simulate failure of the reader Lambda by running:
python3 test_global_chaos.py --scenario error
again, the script will run everything for you:
Pricing
Now it’s time to look at pricing.
AWS Fault Injection Simulator
Based on the AWS Pricing docs, AWS FIS service is priced at $0.10 per experiment minute.
So, for an example:
3 experiments × 3 minutes would cost you $0.9 USD.
This is definitely the largest cost of the project — it has cost me just above $2 to create this project from scratch.
DynamoDB Global Tables
The pricing components are:
- write request units
- storage
- replication write units
Example assumption:
- 10k test writes
- 10k reads
- small dataset (<100 MB)
Approximate cost would be:
- Writes: 10k × $1.25 / million ≈ $0.012
- Reads: 10k × $0.25 / million ≈ $0.0025
- Replication: similar to write cost ≈ $0.01
Total DynamoDB cost is around $0.03 USD.
Lambda
One of the cheaper services of this project:
- $0.0000166667 for every GB-second
- $0.20 per 1M requests
If you are not using your AWS account often, this will most likely be covered by AWS Free Tier.
SSM Parameter Store
As we are using the standard parameters, they are free as per these docs!
Final Thoughts
Many distributed architectures assume that multi-region databases provide seamless replication with minimal design considerations.
In reality, applications must still handle:
- replication delays
- partial failures
- latency anomalies
- degraded infrastructure
By combining:
- DynamoDB Global Tables
- AWS Fault Injection Simulator
- Application-level fallback logic
we can design systems that are explicitly tested against real-world failure scenarios, not just theoretical ones. With these code snippets and the whole GitHub repository as your reference, you have the tools to make your architecture even better.




Top comments (0)