DEV Community

Cover image for DEV Track Spotlight: The Art of Embracing Failures in Serverless Architectures (DEV312)
Gunnar Grosch for AWS

Posted on

DEV Track Spotlight: The Art of Embracing Failures in Serverless Architectures (DEV312)

Serverless architectures promise simplicity wrapped in the immense power of distributed systems. But as Anahit Pogosova, AWS Data Hero and Lead Cloud Architect at F-Secure, reminded us in her DEV312 session: that simplicity is an illusion.

"Serverless managed services are a step up in the abstraction ladder," Anahit explained. "They make the underlying infrastructure seem almost invisible, almost magical. But by using serverless services, we didn't just magically teleport to a different reality. We are still living in the very same messy physical world with all its underlying complexities."

Her session took us on a journey through the hidden pitfalls of distributed systems, armed with real-world war stories and practical strategies for building resilience.

Watch the full session:

The False Sense of Security

The serverless abstraction layer creates a dangerous illusion. When we pick services, connect them together, and watch everything "just work," we might forget about the distributed systems complexity lurking beneath. As Anahit put it: "A serverless architecture is one in which the failure of the computer you definitely didn't know was there can render your entire architecture unusable."

This higher level of abstraction makes spotting potential issues harder because the failures are abstracted away from us too. But those failures didn't go anywhere - they're still embedded in the underlying distributed system, waiting to manifest.

Real-World Battle Scars: The Data Loss Story

Anahit shared her own experience building a near real-time data streaming architecture at scale. The setup seemed simple: a producer sending data to Amazon Kinesis Data Streams, with AWS Lambda consuming and processing the records. It worked perfectly - until they realized they were losing data, and had no idea it was happening.

The culprit? Three interconnected issues that turned small problems into full-blown outages:

Unconfigured Timeouts - The JavaScript SDK's default timeout was infinite (previously two minutes in SDK v2). When requests to Kinesis timed out due to network glitches, the producer application exhausted its resources waiting, becoming incapable of processing new incoming data.

Unhandled Partial Failures - Batch operations like Kinesis PutRecords\ aren't atomic. Part of a batch might succeed while the rest fails due to hitting shard limits during traffic spikes. The SDK returns success, but it's your responsibility to detect and handle those partial failures.

Default Retry Behavior - When Lambda failed to process a bad record, it retried the entire batch indefinitely (until records expired after 24 hours). One poison pill record blocked an entire shard, causing cascading data loss as records expired faster than Lambda could catch up.

The Hidden Superpowers: Timeouts and Retries

Anahit called timeouts and retries "hidden superpowers" because they're incredibly powerful for resilience - but can backfire spectacularly if misused.

Timeouts: Taking Control

Never blindly trust default timeout values. For AWS SDK requests, you must configure appropriate timeouts based on your service and latency expectations. Too long wastes resources; too short triggers premature retries that overwhelm downstream systems.

"When you go back to your code, please check all the requests that go over the network," Anahit urged. "Make sure that you know what those timeout values are. Make sure that you are controlling them."

Retries: The Double-Edged Sword

Retries are inherently selfish - they assume your request is more important than anything else. Poorly implemented retries can amplify small problems into cascading failures that bring entire systems down.

Anahit shared Gregor Hohpe's sobering quote: "Retries have brought more distributed systems down than all the other causes together."

The key principles for safe retries:

Only retry transient failures - Don't retry if it won't help (like overloaded systems or operations with side effects)

Set upper limits - Stop retrying when it's not helping to avoid cascading failures

Use exponential backoff with jitter - Spread retry attempts uniformly to avoid overwhelming systems. Jitter adds randomness to exponential backoff, dramatically increasing retry success rates

Lambda Event Source Mapping: The Hidden Component

Most developers have never heard of Lambda's event source mapping, yet it's critical when using Lambda with Kinesis, DynamoDB Streams, or similar event sources. This hidden component reads records, batches them, and invokes your Lambda function.

By default, if Lambda fails to process a batch, it retries indefinitely until records expire (24+ hours for Kinesis). One bad record creates a "poison pill" that blocks the entire shard, causing:

  • Useless invocations you're still paying for
  • Reprocessing of the same data repeatedly
  • Complete shard blockage while retries continue
  • Cascading data loss as records expire faster than Lambda can catch up

The solution? Configure event source mapping parameters:

  • MaximumRetryAttempts - Set a limit (default: -1 means infinite)
  • MaximumRecordAge - Set a timeout (default: -1 means no limit)
  • BisectBatchOnFunctionError - Split failed batches to isolate bad records
  • DestinationConfig - Route failed records to SQS or SNS for analysis
  • ParallelizationFactor - Scale processing (but watch Lambda concurrency limits!)

"Whatever you do, please do not go with the defaults," Anahit emphasized.

Service Limits and Throttling: The Reality Check

Serverless promises scalability, but we often mistake that for infinite scalability. The reality: we're sharing resources with everyone else, and service limits prevent individual users from monopolizing capacity.

Kinesis shards have limits (1 MB or 1,000 records per second). Lambda has concurrency limits (default: 1,000 per account/region). Hit these limits, and your requests get throttled - they fail.

Understanding and planning for these limits is crucial. A Kinesis stream with 100 shards using a parallelization factor of 10 consumes your entire Lambda concurrency limit, potentially causing unrelated Lambda functions elsewhere in your account to fail.

Key Takeaways

Embrace failures as reality - In distributed systems, everything fails all the time. Plan for it.

Configure timeouts explicitly - Never trust defaults. Set appropriate values for your use case.

Implement safe retries - Only retry transient failures, set limits, use exponential backoff with jitter.

Handle partial failures - Batch operations aren't atomic. Detect and retry failed portions.

Know your service limits - Understand capacity constraints and throttling behavior.

Configure event source mapping - Don't use default settings for Lambda with Kinesis/DynamoDB Streams.

Be paranoid (in a good way) - As Martin Kleppmann said: "In distributed systems, suspicion, pessimism, and paranoia pay off."

Anahit's closing advice resonated deeply: "Distributed systems and architectures are hard, but they can teach us a valuable skill - to embrace the chaos of the real world. Each failure is an opportunity to do things better, to make our systems even more resilient."

As Dr. Werner Vogels reminds us: "Everything fails, all the time." The best thing we can do is keep calm and be prepared when those failures happen.


About This Series

This post is part of DEV Track Spotlight, a series highlighting the incredible sessions from the AWS re:Invent 2025 Developer Community (DEV) track.

The DEV track featured 60 unique sessions delivered by 93 speakers from the AWS Community - including AWS Heroes, AWS Community Builders, and AWS User Group Leaders - alongside speakers from AWS and Amazon. These sessions covered cutting-edge topics including:

  • ๐Ÿค– GenAI & Agentic AI - Multi-agent systems, Strands Agents SDK, Amazon Bedrock
  • ๐Ÿ› ๏ธ Developer Tools - Kiro, Kiro CLI, Amazon Q Developer, AI-driven development
  • ๐Ÿ”’ Security - AI agent security, container security, automated remediation
  • ๐Ÿ—๏ธ Infrastructure - Serverless, containers, edge computing, observability
  • โšก Modernization - Legacy app transformation, CI/CD, feature flags
  • ๐Ÿ“Š Data - Amazon Aurora DSQL, real-time processing, vector databases

Each post in this series dives deep into one session, sharing key insights, practical takeaways, and links to the full recordings. Whether you attended re:Invent or are catching up remotely, these sessions represent the best of our developer community sharing real code, real demos, and real learnings.

Follow along as we spotlight these amazing sessions and celebrate the speakers who made the DEV track what it was!

Top comments (0)