Look Ma, No Servers!

#aws #serverless #etl #datawarehouse

A Year of Daily Failures in Our “Serverless” Data Warehouse

The Setup
Remember when you learned to ride a bike and yelled “Look ma, no hands!” right before eating pavement? That’s basically our journey with AWS Glue.

We adopted serverless data warehousing expecting less infrastructure headaches. Instead, we traded visible problems for invisible ones. Now we’re debugging Linux kernel OOM killers, YARN memory management, and VPC networking—without any of the visibility we’d have if we actually managed the servers.

The Crashes We Took So You Don’t Have To
Every morning at 3AM when our batch jobs kick off, 2-3 of them fail. Not from bugs in our code, but from infrastructure limits that serverless abstractions were supposed to hide:

“No IP addresses available in subnet” - ENI cleanup takes 10-15 minutes while jobs launch immediately, exhausting our /24 subnet during batch windows
“Command failed with exit code 137” - SIGKILL from YARN when executors exceed memory limits, but we can’t see which executor or why
“No space left on device” - Checkpoints materialize to local disk (not memory!), filling 64GB drives when shuffle volumes spike
“No such file or directory” - Race conditions between Glue catalog updates and S3 operations
Here’s the kicker: the same job succeeds Monday, fails Tuesday, succeeds Wednesday. It depends on daily data volume variance, partition skew, and resource contention we can’t predict. Some failures happen 90 minutes into a 2-hour job—after we’ve burned expensive compute and missed our SLAs.

The Serverless Lie
These problems only emerged as our data grew. Everything worked fine in dev and staging with small datasets. Production at scale? That’s where “serverless” shows its true colors.

Turns out you still need to understand: - VPC networking and ENI lifecycle management - Spark executor memory tuning and GC behavior

YARN resource allocation and container limits - Local disk vs memory trade-offs in distributed systems

Except now you debug these without server access, without useful metrics, and with CloudWatch logs that show up 15 minutes late.

The Bottom Line
“Look Ma, No Servers!” sounded great in the architecture review. Twelve months and hundreds of failed jobs later, we’ve learned that serverless just means debugging infrastructure you can’t see.

Spoiler alert: There are servers. You just can’t access them when things go wrong.

DEV Community

Look Ma, No Servers!

Top comments (0)