DEV Community

Cover image for When DynamoDB Global Tables Go Stale: Chaos Testing Replication Lag with AWS FIS

When DynamoDB Global Tables Go Stale: Chaos Testing Replication Lag with AWS FIS

DynamoDB Global Tables replicate data across regions in seconds, but replication is still asynchronous.

That means a simple read from a replica region can occasionally return stale data, which is acceptable in most application as the user doesn’t require the latest available data all the time, but in some systems, stale reads can break important processes and stability of a platform.

So the question becomes:

How do you detect and react to stale replica reads automatically?

So, I set out to find the answer to this question — that’s when I found AWS Fault Injection Simulator (FIS) service, which in combination with DynamoDB Global Tables, can intentionally create failure scenarios and observe a resilient read path responds.

Instead of only testing infrastructure availability, this experiment also focuses on application-level resilience. To be more precise, I’m going to show you the code which can detect stale reads, reroute traffic and recover from the chaos scenario automatically.

Also, I’m going to show you how to experiment with your Lamdbas which are in this stack as well, we are going to take a look at how to inject delays and errors in your Lambda invocations.

The full project code is available in this GitHub repository link.

What This Project Demonstrates

The CDK-based stack deploys a multi-region architecture designed to simulate and handle real-world replication issues.

The stack consists of:

  • A primary DynamoDB region (us-east-1) that handles all writes
  • A replica region (eu-central-1) that normally serves reads
  • A reader Lambda that detects stale data and falls back to the primary when needed
  • A rerouter Lambda that automatically shifts read routing when replication becomes unhealthy
  • AWS Fault Injection Simulator experiments that intentionally introduce failure scenarios, which are:
    • DynamoDB Global Table replication failures
    • Lambda delays
    • Lambda failures

Together, these components allow us to observe how applications behave under replication lag, Lambda failures, and latency injections.

Core Design: Resilient Read Routing

The most interesting part of the architecture of this project is the reader Lambda, which is acting as the system’s resilience layer.

While under the normal conditions:

  1. the reader queries the replica region
  2. the system returns the result with low latency

However, if the data replication is falling behind, the Lambda detects the stale data by using the written_at timestamp.

When stale data is detected, the Lambda performs three actions:

  • If the replica data is older than 10 seconds, it is considered stale
  • The reader falls back to a strongly consistent read from the primary
  • The system self-escalates by updating an SSM parameter (/fis-global/read-routing)
  • Future reads temporarily bypass the replica entirely

This prevents the user from getting repeated stale reads and avoids performing per-request staleness checks during degraded conditions.

The following snippet is just a part of the reader Lambda, the full code can be seen inside the Github repository linked in the intro of this blog post:

STALENESS_THRESHOLD_MS = 10_000  # 10 seconds

def is_stale(item: dict) -> tuple[bool, int]:
    """
    Returns (is_stale, age_ms).
    written_at is stored as a Decimal by DynamoDB — convert safely.
    Only flags stale if age exceeds STALENESS_THRESHOLD_MS.
    """
    written_at = item.get("written_at")
    if not written_at:
        return False, 0
    written_at_ms = int(Decimal(str(written_at)))
    now_ms = int(time.time() * 1000)
    age_ms = now_ms - written_at_ms
    stale = age_ms > STALENESS_THRESHOLD_MS
    return stale, age_ms

def escalate_to_primary():
    """
    Write /fis-global/read-routing = 'primary' to SSM and update the
    local cache. The rerouter Lambda resets it to 'replica' when the
    CloudWatch alarm clears.
    """
    try:
        ssm.put_parameter(
            Name=READ_ROUTING_PARAM,
            Value="primary",
            Type="String",
            Overwrite=True,
        )
        _ssm_cache["value"] = "primary"
        _ssm_cache["fetched_at"] = time.time()
        logger.warning("SSM escalated to 'primary' — stale replication detected")
    except Exception as exc:
        logger.warning("SSM escalation failed (will retry next invocation): %s", exc)

# In the handler: replica read → staleness check → fallback → escalate
if routing == "primary":
    item = read_item(table_primary, pk, sk, consistent=True)
    read_from = "primary"
    routing_reason = "ssm_flag"
else:
    item = read_item(table_replica, pk, sk, consistent=False)
    read_from = "replica"
    if item:
        stale, age_ms = is_stale(item)
        if stale:
            logger.warning("STALE DATA: pk=%s age=%dms — falling back to primary", pk, age_ms)
            item = read_item(table_primary, pk, sk, consistent=True)
            read_from = "primary"
            routing_reason = "stale_data"
            stale_detected = True
            escalate_to_primary()
    # ... replica_miss_fallback when item not on replica yet ...
Enter fullscreen mode Exit fullscreen mode

Automated Read Rerouting

While the reader Lambda handles per-request resilience, the architecture also includes automated infrastructure-based routing recovery by implementing a CloudWatch alarm which monitors a specific DynamoDB metric — ReplicationLatency.

If the latency exceeds 5 seconds, the alarm triggers an SNS notification, which invokes the rerouter Lambda. Of course, the latency threshold can be customized, based on your architecture needs.

The rerouter Lambda updates the routing parameter:

  • primary during degraded replication
  • replica once replication returns to normal

Here is the router Lambda code example:

READ_ROUTING_PARAM = os.environ["READ_ROUTING_PARAM"]
ssm = boto3.client("ssm", region_name=PRIMARY_REGION)

def handler(event, context):
    for record in event.get("Records", []):
        message = json.loads(record["Sns"]["Message"])
        alarm_state = message.get("NewStateValue")
        alarm_name = message.get("AlarmName", "unknown")
        alarm_reason = message.get("NewStateReason", "")

        logger.info(
            "Alarm notification: name=%s state=%s reason=%s",
            alarm_name, alarm_state, alarm_reason,
        )

        if alarm_state == "ALARM":
            new_value = "primary"
            logger.warning(
                "Replication latency alarm FIRING — rerouting reads to primary (%s)",
                PRIMARY_REGION,
            )
        elif alarm_state == "OK":
            new_value = "replica"
            logger.info(
                "Replication latency alarm RESOLVED — restoring reads to replica"
            )
        else:
            logger.info("Ignoring alarm state '%s', no routing change", alarm_state)
            continue

        ssm.put_parameter(
            Name=READ_ROUTING_PARAM,
            Value=new_value,
            Type="String",
            Overwrite=True,
        )
        logger.info("SSM '%s' updated to '%s'", READ_ROUTING_PARAM, new_value)

    return {"statusCode": 200}
Enter fullscreen mode Exit fullscreen mode

Defining Chaos Experiments with AWS Fault Injection Simulator Templates

To validate the design, the project defines three chaos experiments using AWS FIS. Each experiment targets a different way the architecture in this stack can fail.

1. Replication Pause

The first experiment pauses replication for 3 minutes on the DynamoDB Global Table.

This creates a real-world scenario where:

  • writes continue in the primary
  • the replica falls behind
  • the reader detects stale data
  • the system falls back to the primary

This experiment validates the staleness detection and routing escalation logic.

from aws_cdk import aws_fis as fis

# ...

# FIS Experiment Template: Pause Replication + Reroute (3 minutes)
fis.CfnExperimentTemplate(
    self,
    "PauseAndRerouteExperiment",
    description=(
        "Pause replication for 3 min — reader self-escalates SSM to primary "
        "on first stale detect; CW alarm resets it on recovery"
    ),
    role_arn=fis_role.role_arn,
    stop_conditions=[
        fis.CfnExperimentTemplate.ExperimentTemplateStopConditionProperty(
            source="none"
        )
    ],
    targets={
        "globalTable": fis.CfnExperimentTemplate.ExperimentTemplateTargetProperty(
            resource_type="aws:dynamodb:global-table",
            resource_arns=[global_table.table_arn],
            selection_mode="ALL",
        )
    },
    actions={
        "pauseReplication": fis.CfnExperimentTemplate.ExperimentTemplateActionProperty(
            action_id="aws:dynamodb:global-table-pause-replication",
            description="Stop replication to replica for 3 minutes",
            parameters={"duration": "PT3M"},
            targets={"Tables": "globalTable"},
        ),
    },
    tags={"Experiment": "GlobalTable-PauseAndReroute"},
)
Enter fullscreen mode Exit fullscreen mode

2. Lambda Latency Injection

The second experiment introduces artificial startup delays in the Lambda reader.

Using the FIS Lambda extension, we inject:

  • 12-second invocation delays
  • lasting 2 minutes

Even when replication is healthy, this delay can cause the reader to observe stale data relative to new writes.

This tests whether the system can handle latency-induced inconsistencies.

from aws_cdk import aws_fis as fis

# ...

fis.CfnExperimentTemplate(
    self,
    "InvocationDelayExperiment",
    description=(
        "Inject 12 s startup delay into 100% of reader Lambda invocations for 2 min. "
        "Delay > STALENESS_THRESHOLD_MS → stale-data fallback fires even with healthy replication."
    ),
    role_arn=fis_role.role_arn,
    stop_conditions=[
        fis.CfnExperimentTemplate.ExperimentTemplateStopConditionProperty(
            source="none"
        )
    ],
    targets={
        "readerFn": fis.CfnExperimentTemplate.ExperimentTemplateTargetProperty(
            resource_type="aws:lambda:function",
            resource_arns=[reader_lambda.function_arn],
            selection_mode="ALL",
        )
    },
    actions={
        "addDelay": fis.CfnExperimentTemplate.ExperimentTemplateActionProperty(
            action_id="aws:lambda:invocation-add-delay",
            description="Add 12 s pre-handler delay to every reader invocation",
            parameters={
                "duration": "PT2M",
                "invocationPercentage": "100",
                "startupDelayMilliseconds": "12000",
            },
            targets={"Functions": "readerFn"},
        ),
    },
    tags={"Experiment": "Lambda-InvocationDelay"},
)
Enter fullscreen mode Exit fullscreen mode

3. Lambda Invocation Errors

The final experiment forces 100% invocation failures for any targeted Lambda, which in this case, is our reader Lambda, making the handler never execute, which simulates:

  • dependency outages
  • any of the runtime crashes happening during the Lambda initialization phase
  • any of the IAM permission errors

This validates the replica error fallback logic, ensuring the application gracefully handles failures instead of returning errors to clients.

from aws_cdk import aws_fis as fis

# ...

fis.CfnExperimentTemplate(
    self,
    "InvocationErrorExperiment",
    description=(
        "Force reader Lambda to return an error on 100% of invocations for 2 min. "
        "Triggers replica_error_fallback path in reader."
    ),
    role_arn=fis_role.role_arn,
    stop_conditions=[
        fis.CfnExperimentTemplate.ExperimentTemplateStopConditionProperty(
            source="none"
        )
    ],
    targets={
        "readerFn": fis.CfnExperimentTemplate.ExperimentTemplateTargetProperty(
            resource_type="aws:lambda:function",
            resource_arns=[reader_lambda.function_arn],
            selection_mode="ALL",
        )
    },
    actions={
        "injectError": fis.CfnExperimentTemplate.ExperimentTemplateActionProperty(
            action_id="aws:lambda:invocation-error",
            description="Return error response without executing reader handler",
            parameters={
                "duration": "PT2M",
                "invocationPercentage": "100",
                "preventExecution": "true",
            },
            targets={"Functions": "readerFn"},
        ),
    },
    tags={"Experiment": "Lambda-InvocationError"},
)
Enter fullscreen mode Exit fullscreen mode

How to combat possible Lambda initialization issues when starting the FIS experiment

During the development of this project, I kept getting errors inside the Reader Lambda, saying that it’s not supporting the AWS FIS experiment. However, the solution was written about in the documentation.

Every AWS Lambda which is a target inside the AWS FIS experiment template, needs to have a special Lambda layer added to it’s configuration. The ARN of the layer, in my case, was:

"arn:aws:lambda:us-east-1:211125607513:layer:aws-fis-extension-x86_64:280"
Enter fullscreen mode Exit fullscreen mode

and it can be found inside the Parameter Store service on AWS. Keep in mind that:

  • your region
  • Lambda architecture (x86 or ARM),

play a factor in determining the layer ARN.

In addition, to run the FIS experiment on a Lambda, it needs a S3 bucket to store the logs and you need to define the AWS_FIS_CONFIGURATION_LOCATION and AWS_LAMBDA_EXEC_WRAPPER environment variables.

AWS_FIS_CONFIGURATION_LOCATION needs to be set to the ARN of your S3 bucket, while AWS_LAMBDA_EXEC_WRAPPER needs to be set to /opt/aws-fis/bootstrap .

More information can be found in official AWS docs by clicking on the link here.

The final CDK code for the Lambda looks as following:

reader_lambda = lambda_.Function(
    self,
    "ReaderLambda",
    function_name="fis-global-reader",
    runtime=lambda_.Runtime.PYTHON_3_12,
    handler="reader.handler",
    code=lambda_.Code.from_asset("lambda", exclude=[".venv","venv","__pycache__","cdk.out","*.pyc","node_modules",".git"]),
    role=lambda_role,
    # Timeout must exceed max FIS startup delay (12 000 ms) + execution time.
    timeout=Duration.seconds(30),
    memory_size=256,
    architecture=lambda_.Architecture.X86_64, # Make sure that your architecture matches the required FIS layer!
    layers=[
        lambda_.LayerVersion.from_layer_version_arn(
            self,
            "FisExtensionLayer",
                        # Be sure that you have the layer below defined,
                        # the FIS experiment won't work otherwise!
            "arn:aws:lambda:us-east-1:211125607513:layer:aws-fis-extension-x86_64:280",
        )
    ],
    environment={
        # ...
        # FIS writes fault config to this S3 ARN; the extension layer
        # reads it before each invocation.
        "AWS_FIS_CONFIGURATION_LOCATION": fis_config_location_arn,
        # Required alongside the FIS extension layer to intercept invocations.
        "AWS_LAMBDA_EXEC_WRAPPER": "/opt/aws-fis/bootstrap",
    },
    log_group=reader_log_group,
)
Enter fullscreen mode Exit fullscreen mode

Also, please make sure that you give the Lambda necessary IAM permissions! The following role will do the trick

lambda_role = iam.Role(
    self,
    "LambdaRole",
    assumed_by=iam.ServicePrincipal("lambda.amazonaws.com"),
    managed_policies=[
        iam.ManagedPolicy.from_aws_managed_policy_name(
            "service-role/AWSLambdaBasicExecutionRole"
        )
    ],
)
global_table.grant_read_write_data(lambda_role)
read_routing_param.grant_read(lambda_role)
read_routing_param.grant_write(lambda_role)
chaos_param.grant_read(lambda_role)

# grant_read_write_data only covers the primary ARN; the reader Lambda
# also talks to the eu-central-1 replica which has a distinct ARN.
lambda_role.add_to_policy(
    iam.PolicyStatement(
        actions=[
            "dynamodb:GetItem",
            "dynamodb:Query",
            "dynamodb:Scan",
            "dynamodb:BatchGetItem",
        ],
        resources=[
            f"arn:aws:dynamodb:{replica_region}:{self.account}:table/{global_table.table_name}",
            f"arn:aws:dynamodb:{replica_region}:{self.account}:table/{global_table.table_name}/index/*",
        ],
    )
)

# The FIS Lambda extension layer (attached to reader Lambda) must be
# able to ListBucket and GetObject from the FIS config S3 bucket.
# Required by: https://docs.aws.amazon.com/fis/latest/userguide/use-lambda-actions.html#lambda-prerequisites
lambda_role.add_to_policy(
    iam.PolicyStatement(
        sid="AllowListingFisConfigLocation",
        actions=["s3:ListBucket"],
        resources=[fis_config_bucket.bucket_arn],
        conditions={
            "StringLike": {
                "s3:prefix": [f"{fis_config_prefix}*"]
            }
        },
    )
)
lambda_role.add_to_policy(
    iam.PolicyStatement(
        sid="AllowReadingFisConfigObjects",
        actions=["s3:GetObject"],
        resources=[f"{fis_config_bucket.bucket_arn}/{fis_config_prefix}*"],
    )
)

# The FIS extension running inside the reader Lambda needs to call back
# to the FIS service to check active experiment parameters.
lambda_role.add_to_policy(
    iam.PolicyStatement(
        actions=["fis:GetExperiment", "fis:ListExperiments"],
        resources=["*"],
    )
)
Enter fullscreen mode Exit fullscreen mode

Testing the System

To exercise the architecture, the project includes a Python test harness.

The script test_global_chaos.py runs several scenarios:

  • Baseline testing under healthy replication
  • Immediate read-after-write conditions
  • Version drift during replication pauses
  • Latency injection experiments
  • Lambda error injection tests

These tests validate both data correctness and routing behavior under chaos conditions.

Starting the “DynamoDB Pause Replication” experiment

The test harness also includes a watch mode, doing the designed steps of fetching data:

  1. writing the data
  2. read the data from the replica database if there are no issues reported

While an FIS experiment runs, you can observe the system transitioning through multiple states:

  1. Normal replica reads
  2. Replica becomes stale
  3. Fallback to primary
  4. SSM routing escalation
  5. Replica recovery
  6. Routing reset

This creates a clear visualization of resilience behavior over time. You can run this experiment with the following commands:

# Deploy the stack first and make sure you have the region bootstrapped!
cdk bootstrap
cdk deploy

# Install the needed dependencies
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

# Run the Python script
python3 test_global_chaos.py --watch
Enter fullscreen mode Exit fullscreen mode

Here is the transition of when the system is working as expected without the experiment running and then I manually start the experiment inside the FIS console — you can see the data beginning to be stale and the Lambda switches to the primary database right away:

Reader Lambda moving from Replica to Primary Database

After stopping the experiment, the system goes back to normal, taking data from the replica database:

Reader Lambda moving back to Replica Database

I’ve made this script not start the experiment on purpose, so you can start and stop this experiment as you wish as other experiments are designed to be shorter in time.

Simulating the Lambda delay

Now, we are going to cause our reader Lambda to have a 12 second delay before running the handler code. You can start this experiment with:

python3 test_global_chaos.py --scenario delay
Enter fullscreen mode Exit fullscreen mode

and everything will be started for you, the output should like:

Running the AWS FIS Lambda Delay Experiment

It is clear that the experiment is working — you can see that the Lambda is reporting that the data is stale and the execution time is easily above the 12 second mark.

Simulating the Lambda errors

Finally, we are going to simulate failure of the reader Lambda by running:

python3 test_global_chaos.py --scenario error
Enter fullscreen mode Exit fullscreen mode

again, the script will run everything for you:

Running the AWS FIS Lambda Error Experiment

Pricing

Now it’s time to look at pricing.

AWS Fault Injection Simulator

Based on the AWS Pricing docs, AWS FIS service is priced at $0.10 per experiment minute.

So, for an example:

3 experiments × 3 minutes would cost you $0.9 USD.

This is definitely the largest cost of the project — it has cost me just above $2 to create this project from scratch.

DynamoDB Global Tables

The pricing components are:

  • write request units
  • storage
  • replication write units

Example assumption:

  • 10k test writes
  • 10k reads
  • small dataset (<100 MB)

Approximate cost would be:

  • Writes: 10k × $1.25 / million ≈ $0.012
  • Reads: 10k × $0.25 / million ≈ $0.0025
  • Replication: similar to write cost ≈ $0.01

Total DynamoDB cost is around $0.03 USD.

Lambda

One of the cheaper services of this project:

  • $0.0000166667 for every GB-second
  • $0.20 per 1M requests

If you are not using your AWS account often, this will most likely be covered by AWS Free Tier.

SSM Parameter Store

As we are using the standard parameters, they are free as per these docs!

Final Thoughts

Many distributed architectures assume that multi-region databases provide seamless replication with minimal design considerations.

In reality, applications must still handle:

  • replication delays
  • partial failures
  • latency anomalies
  • degraded infrastructure

By combining:

  • DynamoDB Global Tables
  • AWS Fault Injection Simulator
  • Application-level fallback logic

we can design systems that are explicitly tested against real-world failure scenarios, not just theoretical ones. With these code snippets and the whole GitHub repository as your reference, you have the tools to make your architecture even better.

Top comments (0)