David Lyon

Posted on Jun 5

AWS Lambda Silent Crash – A Platform Failure, Not an Application Bug

#serverless #node #aws #cloud

What happens when a production-ready startup proves a runtime failure beyond doubt – and is still told it’s their fault?

Over a seven-week investigation, I uncovered and proved a silent, platform-level crash in AWS Lambda — affecting Node.js functions in a VPC making outbound HTTPS calls. The failure occurred mid-execution, after the function had returned a success response. No logs. No errors. No telemetry. No way to catch it.

From day one, I did what AWS claims to value in a partner.

I stripped the function down to minimal reproducible code.
I tested across runtimes, regions, and infrastructure baselines.
I rebuilt on EC2 and proved that the issue vanished entirely.
I shared logs, traces, metrics, and internal observations.

I escalated through every official channel:

Support dismissed it.
My Account Executive ignored it.
Formal complaints were met with silence.
Internal re-escalations led nowhere.
AWS Activate — the startup programme — refused to engage.
And executive outreach yielded nothing but a two-line response weeks later.

At every stage, I remained professional. I kept the tone restrained. I offered AWS every opportunity to engage constructively.

Instead, they claimed the bug was in my code — despite the function crashing after returning a 201.
They claimed I had forgotten a reject() — despite the error occurring deep inside https.request(), and their own reproduction missing the handler.
They suggested I move to EC2 — by then, I already had.
I asked for Lambda engineering — they gave me sales. Then silence.

Even AWS Activate, whose sole purpose is to support startups like ours, refused to take part. Their response wasn’t technical — it was procedural. A polite copy-paste directing us back to the same failing support system we were already trapped in.

This wasn’t just a Lambda bug. It was a platform-level failure, misdiagnosed through a broken support process, and left to rot in plain sight.

AWS’s internal systems failed. Their support model failed. Their startup engagement model failed. And above all, their cultural commitment to ownership — the thing they claim defines them — was nowhere to be found.

So, we left.
We migrated everything.
Lambda was decommissioned.
Critical services were refactored for Azure.
And our engineering culture now lives on a platform that still understands trust has to be earned — not assumed.

If you're building on Lambda: know this.
It may fail silently.
And if it does — AWS may blame you, even after they reproduce the failure themselves.

This is just the beginning of the story.

For the full deep dive into the silent AWS Lambda crash, our complete diagnostic process, AWS's contradictory responses, and why we ultimately decided to migrate our entire infrastructure, please read the full article on our website:

Read the Full Post: AWS Lambda Silent Crash – A Platform Failure, Not an Application Bug

Top comments (6)

rosswilliams • Jul 14

I don't see your minimal reproduction of the issue. Can you post the reproduction and your expected output?

rosswilliams • Jul 14

I couldn't reproduce your issue based on the sample AWS code. Two important points to note: 1. AWS shuts down the runtime after your function returns.

When emitting an event in node the event is not processed until the next tick.

Your initial code block returns after triggering the event. This triggers the runtime to shut down. It does not wait for your emitted event handlers to process the event. This explains why it runs in EC2, your EC2 instance does not shut down after returning an event response.

If you want to run code after returning a response you should use a Lambda extension.

I can see both sides of the support issue - the language you used referenced a flawed assumption about what is going on inside lambda, and support did not see a problem with how Lambda operates. On the other hand, support should have been clear with you that they are not there to inspect your code and advise you on programming errors.

David Lyon • Jun 5

Thanks! I appreciate you taking the time to read it — happy to dive deeper whenever you're ready. Looking forward to the conversation.

Mauro Rezende • Jun 30

Thanks for sharing. Would you share the rationale of choosing Azure as the provider replacement over others?

David Lyon • Jul 2

Hi Mauro,

Thanks for the question — happy to clarify.
We chose Azure as our AWS replacement after an exhaustive evaluation that included:

Service Parity + Runtime Stability Azure Functions and App Services offered more predictable behaviour for the Node.js + VNet workloads that triggered the AWS platform failure. The same code that silently crashed on AWS Lambda ran flawlessly on Azure, even under identical VPC-like conditions and outbound HTTPS constraints.
Support Accountability We had direct contact with Azure engineers within 48 hours — no deflection, no obfuscation. The escalation path was clear and accountable, unlike the 4-week loop we endured with AWS, where runtime-level logs were withheld, and engineering access was denied.
Multi-Region + Compliance Alignment As a UK-based health platform, Azure’s compliance ecosystem (NHS DSPT, UK Cyber Essentials, ISO, etc.) and sovereign data options made it an easier fit long-term — especially when considering upcoming NHS partnerships and public sector alignment.
Integrated Ecosystem With our stack spanning PostgreSQL, Redis, Key Vault, monitoring, and CI/CD, Azure offered a coherent native experience, reducing reliance on third-party glue. The transition also opened the door for future AI integration via Azure OpenAI and Bicep/Terraform-native infra-as-code.

We tested alternatives — including GCP, Cloudflare Workers, and bare-metal — but Azure offered the most stable, responsive, and partnership-oriented path forward.
What matters most after a platform failure is trust — and Azure showed up when AWS didn’t.
That said, we're keeping them under a watchful eye. Once bitten, twice instrumented.