WanjohiChristopher for AWS Community Builders

Posted on Nov 24

Building a Real-Time Data Lake on AWS: S3, Glue, and Athena in Production

#architecture #aws #dataengineering

The 3 AM Wake-Up Call
It was a Tuesday morning at AWS Reinvent 2024 Las Vegas, when my phone lit up with a PagerDuty alert. Our analytics dashboard had timed out, and the culprit was an Athena query scanning 47 TB of S3 data-costing us $235 in a single failed attempt. The query was looking for a week's worth of user events, but due to poor partitioning, Athena had to scan three years of historical data.

That incident taught me an expensive lesson: a well-architected data lake isn't just about storing data cheaply in S3. It's about making that data queryable, maintainable, and cost-effective at scale.
Over four years of building data platforms on AWS, I've learned that the difference between a $500/month data lake and a $15,000/month one often comes down to a handful of architectural decisions. In this article, I'll share the battle-tested patterns that have helped me and the teams I've worked with build production-grade data lakes that are both performant and economical.

Table of Contents

Architecture foundations: The multi-zone approach that separates concerns
Partitioning strategies: The single biggest lever for query performance and cost
Schema evolution: How to change schemas without breaking downstream systems
Query optimization: Techniques that reduced our query times by 85%
Cost optimization: Real tactics that saved thousands per month.

Architecture Overview: The Three-Zone Data Lake

The foundation of a production data lake is separation of concerns. I've found the three-zone architecture to be the most practical approach:

Raw Zone (Bronze):

Stores data exactly as received from source systems
No transformations, no schema enforcement
Serves as the "source of truth" for reprocessing Example path: s3://datalake-raw/source_system/table_name/year=2024/month=11/day=23/

Processed Zone (Silver):

Cleaned, deduplicated, and validated data
Type conversions applied (strings to proper data types)
Bad records filtered or quarantined Example path: s3://datalake-processed/domain/table_name/year=2024/month=11/day=23/

Curated Zone (Gold):

Business-level aggregations and joins
Optimized for specific analytics use cases
Often denormalized for query performance Example path: s3://datalake-curated/analytics/user_activity_summary/year=2024/month=11/

The Glue Catalog:

AWS Glue Catalog sits at the center, acting as a centralized metadata repository:

Stores table schemas, partition information, and statistics
Shared across Athena, Glue ETL jobs, Redshift Spectrum, and EMR
Enables schema-on-read: define structure at query time, not ingestion time

Typical Data Flow
Ingestion: Data lands in Raw Zone (JSON, CSV, or streaming via Kinesis)
Processing: Glue ETL job reads from Raw, transforms, writes Parquet to Processed Zone
*Curation: * Another Glue job aggregates and joins data, writes to Curated Zone.

Query: Athena queries the Curated Zone for fast, cost-effective analytics.

Partitioning Strategies: Biggest Cost Lever

Please note proper partitioning is the difference between a $5 query and a $500 query.
Here's what I've learned works.
Time-Based Partitioning (The Foundation)
For time-series data (logs, events, transactions), partition by date:

CREATE EXTERNAL TABLE user_events (
    user_id STRING,
    event_type STRING,
    timestamp BIGINT,
    properties STRING
)
PARTITIONED BY (
    year STRING,
    month STRING,
    day STRING
)
STORED AS PARQUET
LOCATION 's3://datalake-processed/events/user_events/';

Why year/month/day separately? Flexibility.
You can query:

A single day: WHERE year='2024' AND month='11' AND day='23'
An entire month: WHERE year='2024' AND month='11'
Multiple months: WHERE year='2024' AND month IN ('10','11')

Partition Projection: The Game Changer
Without partition projection, you need to run MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION every time new data arrives. With projection, Athena generates partition values automatically:

CREATE EXTERNAL TABLE user_events (
    user_id STRING,
    event_type STRING,
    properties STRING
)
PARTITIONED BY (dt STRING)
STORED AS PARQUET
LOCATION 's3://datalake-processed/events/user_events/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'projection.dt.type' = 'date',
    'projection.dt.range' = '2023-01-01,NOW',
    'projection.dt.format' = 'yyyy-MM-dd',
    'storage.location.template' = 's3://datalake-processed/events/user_events/dt=${dt}'
);

Real impact: This eliminated our nightly partition maintenance job and reduced query planning time from 10 seconds to sub-second.

Schema Evolution: Adding Columns Without Breaking Everything

The cardinal rule: append columns to the end, never remove or reorder.

Adding a new column (Safe):

ALTER TABLE user_events 
ADD COLUMNS (device_type STRING);

Old Parquet files without this column? No problem. Athena returns NULL for missing columns.

Changing data types (Don't Do this):

-- DON'T DO THIS:
ALTER TABLE user_events 
CHANGE COLUMN user_id user_id BIGINT;  -- Was STRING

If old files have non-numeric user IDs, queries will fail. Instead: add a new column (user_id_v2) and backfill gradually.

The Golden Rules

Add, don't modify - New columns are safe, changing existing ones is dangerous.
Make new columns nullable - Old data won't have values.
Use separate tables for breaking changes - user_events_v2 is better than breaking user_events.
Document schema versions - Add version info in table properties or a separate registry.

File Formats & Query Optimization: Speed and Cost in Harmony
Why Parquet Wins
Example:

100 GB of JSON logs -> 12 GB as Parquet with Snappy compression
Athena query cost: $5.00 (JSON) vs $0.60 (Parquet)

Parquet's columnar storage means you only read the columns you need, not entire rows.
File Size Matters
Sweet spot: 128 MB - 512 MB per file
Too many small files (< 10 MB)? Athena spends more time listing files than reading them. Compact them with a simple Glue job:

df = spark.read.parquet("s3://path/to/small/files/")
df.repartition(10).write.mode("overwrite").parquet("s3://path/to/compacted/")

Three Query Wins
Select only what you need

-- Scans 10 columns
SELECT * FROM user_events WHERE year='2024';

-- Scans 3 columns (83% cheaper)
SELECT user_id, event_type, timestamp FROM user_events WHERE year='2024';

2. Use CTAS for repeated aggregations
Running the same complex query daily? Materialize it once:

CREATE TABLE daily_summary
WITH (format='PARQUET', partitioned_by=ARRAY['dt']) AS
SELECT dt, user_id, COUNT(*) as events
FROM user_events GROUP BY dt, user_id;

3. Enable query result caching in your Athena Workgroup - Identical queries are free for 24 hours.

Top 5 Cost Optimization Wins
Here are the tactics that saved the most money in my projects:

Partition Projection → Eliminated 80% of "full table scans"

Saved: ~$2,000/month on a moderate data lake

Convert to Parquet → 10x reduction in data scanned

Saved: ~$3,500/month (migrated 500 GB daily ingestion from JSON)

S3 Lifecycle Policies → Move old data to cheaper storage

{
  "Rules": [{
    "Id": "Archive old partitions",
    "Status": "Enabled",
    "Transitions": [{
      "Days": 90,
      "StorageClass": "GLACIER_IR"
    }]
  }]
}

Saved: ~$800/month on storage costs

4. Athena Workgroup Data Scan Limits → Prevent runaway queries

Set per-query limit: 1 TB
Set per-workgroup monthly limit: 50 TB

Glue Job Bookmarks → Process only new data

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Only processes new files since last run
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="raw_events",
    transformation_ctx="datasource"
)

*Conclusion: *