Rocio Baigorria

Posted on Apr 20 • Edited on Apr 29

Your Serverless Data Lake is Lying to You (Add Observability or Lose Data)

#aws #serverless #observability #showdev

TL;DR:

Serverless Data Lakes Scale, But Fail Silently Serverless data lakes scale well, but can fail silently.
Without observability, you risk incomplete or incorrect data.
Add a DLQ to capture failed events.
Use Amazon CloudWatch + Amazon SNS for real visibility.
Trade-off: More components, but far more reliable pipelines.

The Moment I Stopped Trusting "Successful Pipelines"
It was 2 AM.
The pipeline had "completed successfully."
Amazon Athena was returning results.
But the numbers didn’t match.

Digging into Amazon CloudWatch logs, I found the issue:
Messages were stuck in a queue no one was monitoring.
No alerts. No visible errors. Just missing data.

Serverless systems don’t fail loudly. They fail silently.

The Typical Setup (and the Hidden Risk)

Most people build serverless data lakes like this:

Amazon S3 → storage

AWS Glue → transformations

Amazon Athena → querying

It works.
But it assumes that if the pipeline runs… the data is correct.
That assumption is dangerous.

What Was Missing: Observability

The problem wasn’t compute or storage. It was visibility.

I couldn’t answer basic questions:

Did all events get processed?
Did anything fail permanently?
Is data delayed or missing?

If you can’t answer those, you don’t have a production system.

The Fix: Design for Failure

I reworked the architecture for an e-commerce analytics demo with one rule: Every failure must be visible.

1. Add a Buffer (S3 → SQS)
Instead of triggering jobs directly:

Amazon S3 emits events

Amazon SQS captures them

Why it matters:

Decoupling
Retry control
No lost events on spikes

2. Add a DLQ (Non-Negotiable)
Every queue has a Dead Letter Queue.
After retries fail: → Message goes to DLQ.

Now:

Nothing disappears

You can inspect failures

You can replay data

Without a DLQ, you’re guessing.

3. Keep Orchestration Simple
AWS Lambda polls SQS

Triggers AWS Glue jobs
No heavy orchestrators needed for this use case.

4. Optimize for Analytics
Raw data in S3 (CSV/JSON)

Transform to Parquet

Partition by date

This keeps costs down and queries fast in Amazon Athena.

Observability (The Part Most People Skip)

This is the difference between "it works" and "it’s reliable".

Metrics (Amazon CloudWatch)
Queue depth
DLQ size
Glue job failures
Lambda errors
Alerts (Amazon SNS)
DLQ > 0 → alert
Glue job fails → alert
Pipeline inactivity → alert

If something breaks, you should know immediately.

Trade-Offs

What you gain:

Reliable data pipelines

Full visibility

Faster debugging

Confidence in your data

What you pay:

More moving parts (SQS, DLQ, Lambda)

Slight increase in cost

Extra setup for monitoring

The Real Decision
You’re not choosing between simple and complex.
You’re choosing between:

A simple system that hides failures

A system that tells you when it breaks

For production systems, that’s not optional.

Final Thought

Serverless removes infrastructure.
It does NOT remove responsibility.

If you don’t design for observability:
Your system will fail quietly—and you won’t know when.

How are you handling failures in your pipelines?
Do you have a DLQ… or are you trusting logs? 👇

DEV Community