Andrew Kalik

Posted on Jan 24

A Pragmatic, Event-Driven Serverless Data Architecture

#serverless #data #dataengineering

MWAA + Glue + Iceberg + Snowflake

Batch data pipelines are often far more expensive and complex than they need to be.

Many teams still operate always-on schedulers, persistent Spark clusters, and long-running infrastructure for workloads that:

Run a few times per day
Complete in minutes
Are triggered by data arrival, not time

This post walks through a pragmatic, event-driven serverless data architecture on AWS that focuses on real cost reduction and operational simplification, not architectural theory.

The Core Problem: Paying for Idle Data Infrastructure

A traditional batch pipeline commonly includes:

Always-running Airflow workers
Persistent EMR or Spark clusters
Cron-based scheduling for event-driven data
Infrastructure sized for peak usage

In practice, this means teams pay for:

Idle CPU and memory
Idle orchestration capacity
Ongoing patching and operational overhead

For many pipelines, most of the cost is spent waiting.

Design Principles

This architecture is built around a few non-negotiable principles:

Event-driven first, schedule only when necessary
Fully serverless wherever possible
Task-level isolation
Pay only when something executes
Open storage formats to avoid lock-in

The system reacts to data. It does not sit idle waiting for a clock.

Architecture Overview

High-level flow:

Data arrives in Amazon S3 or an upstream system
An event (for example, S3 object creation) triggers orchestration
Amazon MWAA (Serverless) coordinates the workflow
AWS Glue (Serverless) executes transformations
Data is written as Apache Iceberg tables in Amazon S3
Tables are registered in the AWS Glue Data Catalog
Snowflake queries the data using external tables

The key shift is reactive execution — pipelines run because data changed, not because time passed.

Event-Driven Orchestration with MWAA Serverless

Airflow is still used, but only for what it does best:

Dependency management
Retry semantics
Visibility and auditability
Coordinating multiple services

With MWAA Serverless:

There are no always-on workers
There is no capacity planning
There is no idle orchestration cost

Events (for example, S3 notifications via EventBridge) trigger DAG runs only when new data arrives. MWAA spins up to coordinate execution and scales back down afterward.

Airflow becomes control flow, not infrastructure.

Glue Serverless as Event-Driven Compute

Each transformation step is implemented as a small, purpose-built Glue job:

One responsibility per job
No shared cluster assumptions
Independent scaling and retries

From a cost perspective:

Jobs run only when triggered
There is no idle cluster time
Failures are isolated and cheap to retry

Instead of paying for a Spark cluster all day, you pay per execution.

Why Apache Iceberg Enables Cost Reduction

Apache Iceberg is foundational to making this architecture efficient.

Iceberg enables:

Schema evolution without rewriting entire tables
Partition evolution without backfills
Snapshot-based time travel for recovery
Multiple engines reading the same data

From a cost perspective:

No duplicate datasets per consumer
No full-table rewrites for small schema changes
No tight coupling between producers and consumers

Iceberg supports incremental, event-driven writes without downstream reprocessing.

Surfacing Data to Snowflake Without Duplication

Snowflake consumes Iceberg tables using external tables backed by Amazon S3 and the Glue Data Catalog.

This approach:

Avoids copying data into Snowflake-managed storage
Makes data available immediately after it is written
Keeps storage costs centralized in S3

If performance requirements change later, data can still be materialized — but duplication becomes a deliberate choice, not a default.

Where the Cost Savings Actually Come From

This architecture removes several major cost drivers.

Traditional Pipeline Costs

24/7 Airflow workers
Always-on Spark or EMR clusters
Idle compute between scheduled runs
Operational effort maintaining infrastructure

Even small clusters add up over time.

Costs Removed by This Architecture

Eliminated

Idle Airflow capacity
Persistent Spark clusters
Long-running EC2 instances
Custom metastore infrastructure

Introduced

Per-event MWAA execution cost
Per-job Glue runtime cost
Object storage costs in S3

In practice, teams often see:

Near-zero idle compute spend
Costs directly proportional to data volume
Predictable per-run pricing

For short-lived batch workloads, this frequently results in meaningful cost reduction without sacrificing capability.

Operational Simplification (The Hidden Savings)

Cost is not just dollars.

This architecture also reduces:

On-call surface area
Patch and upgrade cycles
Capacity planning work
Failure blast radius

Fewer always-on systems means fewer things that can fail silently.

Tradeoffs to Be Aware Of

This pattern does introduce responsibilities:

Event-driven pipelines require idempotent design
Iceberg requires schema and table discipline
External tables may not suit all query patterns

These are engineering tradeoffs, not infrastructure problems.

When This Pattern Works Best

Strong fit

Event-driven or near-real-time batch ingestion
Teams optimizing for cost and simplicity
Lakehouse or multi-engine environments

Less ideal

Ultra-low-latency streaming
Always-on interactive workloads
Extremely large, tightly coupled transformations

Final Thoughts

Serverless data architectures are not about removing structure.

They are about aligning cost and complexity with reality.

By combining MWAA Serverless, Glue Serverless, and Apache Iceberg, teams can build pipelines that:

React to data instead of schedules
Eliminate idle compute
Scale naturally
Remain flexible as requirements evolve

In many cases, the simplest architecture is also the most cost-effective one.

DEV Community