kienly

Posted on Jan 16

Building a Near Real-Time Analytics Pipeline with AWS Zero-ETL

#analytics #aws #database #dataengineering

TL;DR: This guide shows how to replicate transactional data from Amazon Aurora (MySQL) to Amazon Redshift using AWS Zero-ETL, including Aurora binlog configuration, a sample SAM template, and deployment notes.

Introduction

AWS Zero-ETL enables near real-time replication from Aurora to Redshift without custom ETL code. This guide covers the required Aurora settings, a Lambda ingestion example, deployment with AWS SAM, and the steps to configure the Zero-ETL integration for analytics in Redshift.

What is Zero-ETL?

Zero-ETL is AWS's managed integration that reduces the need for custom extract-transform-load pipelines by streaming change data from Aurora into Redshift.

flowchart LR
    subgraph Traditional
        S1[Source] --> E[Extract] --> T[Transform] --> L[Load] --> D1[DW]
    end

flowchart LR
    subgraph ZeroETL["Zero-ETL"]
        S2[Aurora] -->|CDC Auto| D2[Redshift]
    end

Key benefits:

No pipeline code to maintain
Near real-time replication (seconds to minutes latency)
Transactional consistency
Automatic schema mapping between source and target

Prerequisites

Ensure you have the following before proceeding:

AWS CLI v2 configured with appropriate permissions
AWS SAM CLI installed (pip install aws-sam-cli)
Python 3.12+
Docker (for SAM local testing)
An AWS account with permissions for RDS, Redshift, Lambda, VPC

Step 1: Project Structure

zero-etl-demo/
├── template.yaml
├── samconfig.toml
├── src/lambda/
│   ├── handler.py
│   ├── container.py
│   ├── service.py
│   └── requirements.txt
├── db/schema.sql
└── scripts/setup_zero_etl.sh

Step 2: SAM Template (Infrastructure as Code)

Aurora MySQL Configuration

DBClusterParameterGroup:
  Type: AWS::RDS::DBClusterParameterGroup
  Properties:
    Family: aurora-mysql8.0
    Parameters:
      binlog_format: ROW
      aurora_enhanced_binlog: "1"
      binlog_row_image: FULL

AuroraDBCluster:
  Type: AWS::RDS::DBCluster
  Properties:
    Engine: aurora-mysql
    EngineVersion: 8.0.mysql_aurora.3.04.0
    StorageEncrypted: true
    DBClusterParameterGroupName: !Ref DBClusterParameterGroup

Note: After CloudFormation creates the cluster with the parameter group, reboot the writer instance via Console or CLI for binlog parameters to apply.

Zero-ETL requirements:

Engine version must be Aurora MySQL 3.04.0+
Storage encryption must be enabled
Binary logging must be ROW format
Writer instance must be rebooted after parameter group changes

Lambda Function

IngestFunction:
  Type: AWS::Serverless::Function
  Properties:
    Runtime: python3.12
    Handler: handler.lambda_handler
    VpcConfig:
      SecurityGroupIds:
        - !Ref LambdaSecurityGroup
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
    Events:
      ScheduledIngestion:
        Type: Schedule
        Properties:
          Schedule: rate(5 minutes)

Step 3: Lambda Code with Dependency Injection

flowchart TB
    Handler --> Container
    Container --> APIClient
    Container --> Database
    Container --> Processor
    APIClient --> Service[ZeroETLService]
    Database --> Service
    Processor --> Service

DI Container

class ServiceContainer:
    def __init__(self):
        self._factories = {}
        self._instances = {}

    def register(self, name: str, factory: Callable):
        self._factories[name] = factory

    def resolve(self, name: str):
        if name not in self._instances:
            self._instances[name] = self._factories[name](self)
        return self._instances[name]

Lambda Handler

def lambda_handler(event, context):
    container = None
    try:
        container = create_container()
        service = container.resolve("zero_etl_service")
        summary = service.run_full_ingestion()

        return {
            "statusCode": 200 if summary.success else 207,
            "body": json.dumps({
                "run_id": summary.run_id,
                "total_records": summary.total_records,
            })
        }
    finally:
        if container:
            container.dispose()

Step 4: Database Schema

Note: Foreign keys with CASCADE are not supported by Zero-ETL.

-- This will fail with Zero-ETL
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE

-- Use RESTRICT instead
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT

CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL
) ENGINE=InnoDB;

CREATE TABLE posts (
    id INT PRIMARY KEY,
    user_id INT NOT NULL,
    title VARCHAR(500) NOT NULL,
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT
) ENGINE=InnoDB;

Step 5: Deploy the Stack

sam build
sam deploy --guided

After deployment, obtain the Aurora endpoint, reboot the writer instance to apply binlog settings, and run the schema script.

# Get Aurora endpoint
ENDPOINT=$(aws cloudformation describe-stacks \
    --stack-name zero-etl-demo \
    --query 'Stacks[0].Outputs[?OutputKey==`AuroraClusterEndpoint`].OutputValue' \
    --output text)

# Reboot writer instance
aws rds reboot-db-instance --db-instance-identifier zero-etl-demo-instance-1

# Wait for instance to be available, then run schema
mysql -h $ENDPOINT -u admin -p < db/schema.sql

Step 6: Set Up Zero-ETL Integration

aws rds create-integration \
    --integration-name zero-etl-demo \
    --source-arn arn:aws:rds:us-east-1:123456789:cluster:my-cluster \
    --target-arn arn:aws:redshift-serverless:us-east-1:123456789:namespace/my-ns

Note: Initial synchronization may take 20–30 minutes.

Step 7: Query Analytics in Redshift

CREATE DATABASE zero_etl_db FROM INTEGRATION 'arn:aws:rds:...';

-- Posts per user
SELECT u.name, COUNT(p.id) as post_count
FROM zero_etl_db.public.users u
LEFT JOIN zero_etl_db.public.posts p ON u.id = p.user_id
GROUP BY u.id, u.name
ORDER BY post_count DESC;

Data Flow Summary

sequenceDiagram
    participant EB as EventBridge
    participant L as Lambda
    participant API as JSONPlaceholder
    participant A as Aurora
    participant R as Redshift

    EB->>L: Trigger (5 min)
    L->>API: GET /users, /posts
    API-->>L: JSON
    L->>A: Batch INSERT
    Note over A,R: Zero-ETL CDC (auto)
    A-->>R: Replicate

Monitoring

Metrics to monitor include:

Lambda: Duration, Errors
RDS: ZeroETLIntegrationLatency (or equivalent)

Cost Considerations

Service	Est. Monthly Cost
Aurora MySQL	~$200
Lambda	~$5
NAT Gateway	~$35
Redshift	~$50
Total	~$290

Limitations

Foreign keys with CASCADE are not supported — use RESTRICT instead.
Reboot is required after parameter group changes for binlog settings to apply.
Initial synchronization can take 20–30 minutes.
Aurora engine version requirements apply (3.04.0+).

Conclusion

AWS Zero-ETL provides a managed replication path between Aurora and Redshift that reduces operational overhead and supports near-real-time analytics. It is suitable for cases where replication without custom ETL logic is desired.

Resources

Questions or feedback are welcome. Connect via LinkedIn for further discussion.

DEV Community