MWAA + Glue + Iceberg + Snowflake
Batch data pipelines are often far more expensive and complex than they need to be.
Many teams still operate always-on schedulers, persistent Spark clusters, and long-running infrastructure for workloads that:
- Run a few times per day
- Complete in minutes
- Are triggered by data arrival, not time
This post walks through a pragmatic, event-driven serverless data architecture on AWS that focuses on real cost reduction and operational simplification, not architectural theory.
The Core Problem: Paying for Idle Data Infrastructure
A traditional batch pipeline commonly includes:
- Always-running Airflow workers
- Persistent EMR or Spark clusters
- Cron-based scheduling for event-driven data
- Infrastructure sized for peak usage
In practice, this means teams pay for:
- Idle CPU and memory
- Idle orchestration capacity
- Ongoing patching and operational overhead
For many pipelines, most of the cost is spent waiting.
Design Principles
This architecture is built around a few non-negotiable principles:
- Event-driven first, schedule only when necessary
- Fully serverless wherever possible
- Task-level isolation
- Pay only when something executes
- Open storage formats to avoid lock-in
The system reacts to data. It does not sit idle waiting for a clock.
Architecture Overview
High-level flow:
- Data arrives in Amazon S3 or an upstream system
- An event (for example, S3 object creation) triggers orchestration
- Amazon MWAA (Serverless) coordinates the workflow
- AWS Glue (Serverless) executes transformations
- Data is written as Apache Iceberg tables in Amazon S3
- Tables are registered in the AWS Glue Data Catalog
- Snowflake queries the data using external tables
The key shift is reactive execution — pipelines run because data changed, not because time passed.
Event-Driven Orchestration with MWAA Serverless
Airflow is still used, but only for what it does best:
- Dependency management
- Retry semantics
- Visibility and auditability
- Coordinating multiple services
With MWAA Serverless:
- There are no always-on workers
- There is no capacity planning
- There is no idle orchestration cost
Events (for example, S3 notifications via EventBridge) trigger DAG runs only when new data arrives. MWAA spins up to coordinate execution and scales back down afterward.
Airflow becomes control flow, not infrastructure.
Glue Serverless as Event-Driven Compute
Each transformation step is implemented as a small, purpose-built Glue job:
- One responsibility per job
- No shared cluster assumptions
- Independent scaling and retries
From a cost perspective:
- Jobs run only when triggered
- There is no idle cluster time
- Failures are isolated and cheap to retry
Instead of paying for a Spark cluster all day, you pay per execution.
Why Apache Iceberg Enables Cost Reduction
Apache Iceberg is foundational to making this architecture efficient.
Iceberg enables:
- Schema evolution without rewriting entire tables
- Partition evolution without backfills
- Snapshot-based time travel for recovery
- Multiple engines reading the same data
From a cost perspective:
- No duplicate datasets per consumer
- No full-table rewrites for small schema changes
- No tight coupling between producers and consumers
Iceberg supports incremental, event-driven writes without downstream reprocessing.
Surfacing Data to Snowflake Without Duplication
Snowflake consumes Iceberg tables using external tables backed by Amazon S3 and the Glue Data Catalog.
This approach:
- Avoids copying data into Snowflake-managed storage
- Makes data available immediately after it is written
- Keeps storage costs centralized in S3
If performance requirements change later, data can still be materialized — but duplication becomes a deliberate choice, not a default.
Where the Cost Savings Actually Come From
This architecture removes several major cost drivers.
Traditional Pipeline Costs
- 24/7 Airflow workers
- Always-on Spark or EMR clusters
- Idle compute between scheduled runs
- Operational effort maintaining infrastructure
Even small clusters add up over time.
Costs Removed by This Architecture
Eliminated
- Idle Airflow capacity
- Persistent Spark clusters
- Long-running EC2 instances
- Custom metastore infrastructure
Introduced
- Per-event MWAA execution cost
- Per-job Glue runtime cost
- Object storage costs in S3
In practice, teams often see:
- Near-zero idle compute spend
- Costs directly proportional to data volume
- Predictable per-run pricing
For short-lived batch workloads, this frequently results in meaningful cost reduction without sacrificing capability.
Operational Simplification (The Hidden Savings)
Cost is not just dollars.
This architecture also reduces:
- On-call surface area
- Patch and upgrade cycles
- Capacity planning work
- Failure blast radius
Fewer always-on systems means fewer things that can fail silently.
Tradeoffs to Be Aware Of
This pattern does introduce responsibilities:
- Event-driven pipelines require idempotent design
- Iceberg requires schema and table discipline
- External tables may not suit all query patterns
These are engineering tradeoffs, not infrastructure problems.
When This Pattern Works Best
Strong fit
- Event-driven or near-real-time batch ingestion
- Teams optimizing for cost and simplicity
- Lakehouse or multi-engine environments
Less ideal
- Ultra-low-latency streaming
- Always-on interactive workloads
- Extremely large, tightly coupled transformations
Final Thoughts
Serverless data architectures are not about removing structure.
They are about aligning cost and complexity with reality.
By combining MWAA Serverless, Glue Serverless, and Apache Iceberg, teams can build pipelines that:
- React to data instead of schedules
- Eliminate idle compute
- Scale naturally
- Remain flexible as requirements evolve
In many cases, the simplest architecture is also the most cost-effective one.
Top comments (0)