DEV Community

kienly
kienly

Posted on

Building a Near Real-Time Analytics Pipeline with AWS Zero-ETL

TL;DR: This guide shows how to replicate transactional data from Amazon Aurora (MySQL) to Amazon Redshift using AWS Zero-ETL, including Aurora binlog configuration, a sample SAM template, and deployment notes.


Introduction

AWS Zero-ETL enables near real-time replication from Aurora to Redshift without custom ETL code. This guide covers the required Aurora settings, a Lambda ingestion example, deployment with AWS SAM, and the steps to configure the Zero-ETL integration for analytics in Redshift.

What is Zero-ETL?

Zero-ETL is AWS's managed integration that reduces the need for custom extract-transform-load pipelines by streaming change data from Aurora into Redshift.

flowchart LR
    subgraph Traditional
        S1[Source] --> E[Extract] --> T[Transform] --> L[Load] --> D1[DW]
    end
Enter fullscreen mode Exit fullscreen mode
flowchart LR
    subgraph ZeroETL["Zero-ETL"]
        S2[Aurora] -->|CDC Auto| D2[Redshift]
    end
Enter fullscreen mode Exit fullscreen mode

Key benefits:

  • No pipeline code to maintain
  • Near real-time replication (seconds to minutes latency)
  • Transactional consistency
  • Automatic schema mapping between source and target

Prerequisites

Ensure you have the following before proceeding:

  • AWS CLI v2 configured with appropriate permissions
  • AWS SAM CLI installed (pip install aws-sam-cli)
  • Python 3.12+
  • Docker (for SAM local testing)
  • An AWS account with permissions for RDS, Redshift, Lambda, VPC

Step 1: Project Structure

zero-etl-demo/
├── template.yaml
├── samconfig.toml
├── src/lambda/
│   ├── handler.py
│   ├── container.py
│   ├── service.py
│   └── requirements.txt
├── db/schema.sql
└── scripts/setup_zero_etl.sh
Enter fullscreen mode Exit fullscreen mode

Step 2: SAM Template (Infrastructure as Code)

Aurora MySQL Configuration

DBClusterParameterGroup:
  Type: AWS::RDS::DBClusterParameterGroup
  Properties:
    Family: aurora-mysql8.0
    Parameters:
      binlog_format: ROW
      aurora_enhanced_binlog: "1"
      binlog_row_image: FULL

AuroraDBCluster:
  Type: AWS::RDS::DBCluster
  Properties:
    Engine: aurora-mysql
    EngineVersion: 8.0.mysql_aurora.3.04.0
    StorageEncrypted: true
    DBClusterParameterGroupName: !Ref DBClusterParameterGroup
Enter fullscreen mode Exit fullscreen mode

Note: After CloudFormation creates the cluster with the parameter group, reboot the writer instance via Console or CLI for binlog parameters to apply.

Zero-ETL requirements:

  • Engine version must be Aurora MySQL 3.04.0+
  • Storage encryption must be enabled
  • Binary logging must be ROW format
  • Writer instance must be rebooted after parameter group changes

Lambda Function

IngestFunction:
  Type: AWS::Serverless::Function
  Properties:
    Runtime: python3.12
    Handler: handler.lambda_handler
    VpcConfig:
      SecurityGroupIds:
        - !Ref LambdaSecurityGroup
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
    Events:
      ScheduledIngestion:
        Type: Schedule
        Properties:
          Schedule: rate(5 minutes)
Enter fullscreen mode Exit fullscreen mode

Step 3: Lambda Code with Dependency Injection

flowchart TB
    Handler --> Container
    Container --> APIClient
    Container --> Database
    Container --> Processor
    APIClient --> Service[ZeroETLService]
    Database --> Service
    Processor --> Service
Enter fullscreen mode Exit fullscreen mode

DI Container

class ServiceContainer:
    def __init__(self):
        self._factories = {}
        self._instances = {}

    def register(self, name: str, factory: Callable):
        self._factories[name] = factory

    def resolve(self, name: str):
        if name not in self._instances:
            self._instances[name] = self._factories[name](self)
        return self._instances[name]
Enter fullscreen mode Exit fullscreen mode

Lambda Handler

def lambda_handler(event, context):
    container = None
    try:
        container = create_container()
        service = container.resolve("zero_etl_service")
        summary = service.run_full_ingestion()

        return {
            "statusCode": 200 if summary.success else 207,
            "body": json.dumps({
                "run_id": summary.run_id,
                "total_records": summary.total_records,
            })
        }
    finally:
        if container:
            container.dispose()
Enter fullscreen mode Exit fullscreen mode

Step 4: Database Schema

Note: Foreign keys with CASCADE are not supported by Zero-ETL.

-- This will fail with Zero-ETL
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE

-- Use RESTRICT instead
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT
Enter fullscreen mode Exit fullscreen mode
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL
) ENGINE=InnoDB;

CREATE TABLE posts (
    id INT PRIMARY KEY,
    user_id INT NOT NULL,
    title VARCHAR(500) NOT NULL,
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT
) ENGINE=InnoDB;
Enter fullscreen mode Exit fullscreen mode

Step 5: Deploy the Stack

sam build
sam deploy --guided
Enter fullscreen mode Exit fullscreen mode

After deployment, obtain the Aurora endpoint, reboot the writer instance to apply binlog settings, and run the schema script.

# Get Aurora endpoint
ENDPOINT=$(aws cloudformation describe-stacks \
    --stack-name zero-etl-demo \
    --query 'Stacks[0].Outputs[?OutputKey==`AuroraClusterEndpoint`].OutputValue' \
    --output text)

# Reboot writer instance
aws rds reboot-db-instance --db-instance-identifier zero-etl-demo-instance-1

# Wait for instance to be available, then run schema
mysql -h $ENDPOINT -u admin -p < db/schema.sql
Enter fullscreen mode Exit fullscreen mode

Step 6: Set Up Zero-ETL Integration

aws rds create-integration \
    --integration-name zero-etl-demo \
    --source-arn arn:aws:rds:us-east-1:123456789:cluster:my-cluster \
    --target-arn arn:aws:redshift-serverless:us-east-1:123456789:namespace/my-ns
Enter fullscreen mode Exit fullscreen mode

Note: Initial synchronization may take 20–30 minutes.

Step 7: Query Analytics in Redshift

CREATE DATABASE zero_etl_db FROM INTEGRATION 'arn:aws:rds:...';

-- Posts per user
SELECT u.name, COUNT(p.id) as post_count
FROM zero_etl_db.public.users u
LEFT JOIN zero_etl_db.public.posts p ON u.id = p.user_id
GROUP BY u.id, u.name
ORDER BY post_count DESC;
Enter fullscreen mode Exit fullscreen mode

Data Flow Summary

sequenceDiagram
    participant EB as EventBridge
    participant L as Lambda
    participant API as JSONPlaceholder
    participant A as Aurora
    participant R as Redshift

    EB->>L: Trigger (5 min)
    L->>API: GET /users, /posts
    API-->>L: JSON
    L->>A: Batch INSERT
    Note over A,R: Zero-ETL CDC (auto)
    A-->>R: Replicate
Enter fullscreen mode Exit fullscreen mode

Monitoring

Metrics to monitor include:

  • Lambda: Duration, Errors
  • RDS: ZeroETLIntegrationLatency (or equivalent)

Cost Considerations

Service Est. Monthly Cost
Aurora MySQL ~$200
Lambda ~$5
NAT Gateway ~$35
Redshift ~$50
Total ~$290

Limitations

  1. Foreign keys with CASCADE are not supported — use RESTRICT instead.
  2. Reboot is required after parameter group changes for binlog settings to apply.
  3. Initial synchronization can take 20–30 minutes.
  4. Aurora engine version requirements apply (3.04.0+).

Conclusion

AWS Zero-ETL provides a managed replication path between Aurora and Redshift that reduces operational overhead and supports near-real-time analytics. It is suitable for cases where replication without custom ETL logic is desired.


Resources


Questions or feedback are welcome. Connect via LinkedIn for further discussion.

Top comments (0)