The 3 AM Wake-Up Call
It was a Tuesday morning at AWS Reinvent 2024 Las Vegas, when my phone lit up with a PagerDuty alert. Our analytics dashboard had timed out, and the culprit was an Athena query scanning 47 TB of S3 data-costing us $235 in a single failed attempt. The query was looking for a week's worth of user events, but due to poor partitioning, Athena had to scan three years of historical data.
That incident taught me an expensive lesson: a well-architected data lake isn't just about storing data cheaply in S3. It's about making that data queryable, maintainable, and cost-effective at scale.
Over four years of building data platforms on AWS, I've learned that the difference between a $500/month data lake and a $15,000/month one often comes down to a handful of architectural decisions. In this article, I'll share the battle-tested patterns that have helped me and the teams I've worked with build production-grade data lakes that are both performant and economical.
Table of Contents
- Architecture foundations: The multi-zone approach that separates concerns
- Partitioning strategies: The single biggest lever for query performance and cost
- Schema evolution: How to change schemas without breaking downstream systems
- Query optimization: Techniques that reduced our query times by 85%
- Cost optimization: Real tactics that saved thousands per month.
Architecture Overview: The Three-Zone Data Lake
The foundation of a production data lake is separation of concerns. I've found the three-zone architecture to be the most practical approach:
Raw Zone (Bronze):
- Stores data exactly as received from source systems
- No transformations, no schema enforcement
- Serves as the "source of truth" for reprocessing Example path: s3://datalake-raw/source_system/table_name/year=2024/month=11/day=23/
Processed Zone (Silver):
- Cleaned, deduplicated, and validated data
- Type conversions applied (strings to proper data types)
- Bad records filtered or quarantined Example path: s3://datalake-processed/domain/table_name/year=2024/month=11/day=23/
Curated Zone (Gold):
- Business-level aggregations and joins
- Optimized for specific analytics use cases
- Often denormalized for query performance Example path: s3://datalake-curated/analytics/user_activity_summary/year=2024/month=11/
The Glue Catalog:
AWS Glue Catalog sits at the center, acting as a centralized metadata repository:
Stores table schemas, partition information, and statistics
Shared across Athena, Glue ETL jobs, Redshift Spectrum, and EMR
Enables schema-on-read: define structure at query time, not ingestion time
Typical Data Flow
Ingestion: Data lands in Raw Zone (JSON, CSV, or streaming via Kinesis)
Processing: Glue ETL job reads from Raw, transforms, writes Parquet to Processed Zone
*Curation: * Another Glue job aggregates and joins data, writes to Curated Zone.
Query: Athena queries the Curated Zone for fast, cost-effective analytics.
Partitioning Strategies: Biggest Cost Lever
Please note proper partitioning is the difference between a $5 query and a $500 query.
Here's what I've learned works.
Time-Based Partitioning (The Foundation)
For time-series data (logs, events, transactions), partition by date:
CREATE EXTERNAL TABLE user_events (
user_id STRING,
event_type STRING,
timestamp BIGINT,
properties STRING
)
PARTITIONED BY (
year STRING,
month STRING,
day STRING
)
STORED AS PARQUET
LOCATION 's3://datalake-processed/events/user_events/';
Why year/month/day separately? Flexibility.
You can query:
- A single day: WHERE year='2024' AND month='11' AND day='23'
- An entire month: WHERE year='2024' AND month='11'
- Multiple months: WHERE year='2024' AND month IN ('10','11')
Partition Projection: The Game Changer
Without partition projection, you need to run MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION every time new data arrives. With projection, Athena generates partition values automatically:
CREATE EXTERNAL TABLE user_events (
user_id STRING,
event_type STRING,
properties STRING
)
PARTITIONED BY (dt STRING)
STORED AS PARQUET
LOCATION 's3://datalake-processed/events/user_events/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.dt.type' = 'date',
'projection.dt.range' = '2023-01-01,NOW',
'projection.dt.format' = 'yyyy-MM-dd',
'storage.location.template' = 's3://datalake-processed/events/user_events/dt=${dt}'
);
Real impact: This eliminated our nightly partition maintenance job and reduced query planning time from 10 seconds to sub-second.
Schema Evolution: Adding Columns Without Breaking Everything
The cardinal rule: append columns to the end, never remove or reorder.
Adding a new column (Safe):
ALTER TABLE user_events
ADD COLUMNS (device_type STRING);
Old Parquet files without this column? No problem. Athena returns NULL for missing columns.
Changing data types (Don't Do this):
-- DON'T DO THIS:
ALTER TABLE user_events
CHANGE COLUMN user_id user_id BIGINT; -- Was STRING
If old files have non-numeric user IDs, queries will fail. Instead: add a new column (user_id_v2) and backfill gradually.
The Golden Rules
- Add, don't modify - New columns are safe, changing existing ones is dangerous.
- Make new columns nullable - Old data won't have values.
- Use separate tables for breaking changes - user_events_v2 is better than breaking user_events.
- Document schema versions - Add version info in table properties or a separate registry.
File Formats & Query Optimization: Speed and Cost in Harmony
Why Parquet Wins
Example:
100 GB of JSON logs -> 12 GB as Parquet with Snappy compression
Athena query cost: $5.00 (JSON) vs $0.60 (Parquet)
Parquet's columnar storage means you only read the columns you need, not entire rows.
File Size Matters
Sweet spot: 128 MB - 512 MB per file
Too many small files (< 10 MB)? Athena spends more time listing files than reading them. Compact them with a simple Glue job:
df = spark.read.parquet("s3://path/to/small/files/")
df.repartition(10).write.mode("overwrite").parquet("s3://path/to/compacted/")
Three Query Wins
Select only what you need
-- Scans 10 columns
SELECT * FROM user_events WHERE year='2024';
-- Scans 3 columns (83% cheaper)
SELECT user_id, event_type, timestamp FROM user_events WHERE year='2024';
2. Use CTAS for repeated aggregations
Running the same complex query daily? Materialize it once:
CREATE TABLE daily_summary
WITH (format='PARQUET', partitioned_by=ARRAY['dt']) AS
SELECT dt, user_id, COUNT(*) as events
FROM user_events GROUP BY dt, user_id;
3. Enable query result caching in your Athena Workgroup - Identical queries are free for 24 hours.
Top 5 Cost Optimization Wins
Here are the tactics that saved the most money in my projects:
- Partition Projection → Eliminated 80% of "full table scans"
Saved: ~$2,000/month on a moderate data lake
- Convert to Parquet → 10x reduction in data scanned
Saved: ~$3,500/month (migrated 500 GB daily ingestion from JSON)
- S3 Lifecycle Policies → Move old data to cheaper storage
{
"Rules": [{
"Id": "Archive old partitions",
"Status": "Enabled",
"Transitions": [{
"Days": 90,
"StorageClass": "GLACIER_IR"
}]
}]
}
- Saved: ~$800/month on storage costs
4. Athena Workgroup Data Scan Limits → Prevent runaway queries
Set per-query limit: 1 TB
Set per-workgroup monthly limit: 50 TB
- Glue Job Bookmarks → Process only new data
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Only processes new files since last run
datasource = glueContext.create_dynamic_frame.from_catalog(
database="my_database",
table_name="raw_events",
transformation_ctx="datasource"
)
*Conclusion: *
- Partition by time, always. Use partition projection.
- Parquet isn't optional—it's essential for cost control.
- Schema evolution requires discipline: add columns, don't modify them.
- Monitor data scanned per query. It's your cost meter.
Questions from your own data lake implementations? I'd love to hear them.

Top comments (0)