Postmortem: Cloud Run Instance Exhaustion Caused a 500 Error Spike

#postmortem #cloud #instance #exhaustion

Postmortem: Cloud Run Instance Exhaustion Caused 500 Error Spike

Executive Summary

On May 15, 2024, a production Cloud Run-hosted user authentication service experienced severe instance exhaustion, leading to a 72% spike in 500 Internal Server Error responses over a 45-minute window. The incident impacted ~12,000 user requests and degraded performance for three downstream high-traffic services. No data loss or corruption was reported.

Incident Impact

The incident lasted from 14:22 UTC to 15:07 UTC on May 15, 2024 (total 45 minutes). Key impact metrics:

Baseline 500 error rate: 0.2% → Peak error rate: 72%
Affected services: User Auth API, Payment Processing Endpoint, Dashboard Data Service
Total failed requests: ~12,000
No customer data loss or financial impact reported

Root Cause Analysis

Investigation revealed the incident was caused by a combination of four compounding factors:

Uncommunicated traffic surge: An unannounced marketing campaign launched at 14:20 UTC drove 2x normal traffic to the service, with no prior notice to the infrastructure team.
Inadequate autoscaling configuration: The Cloud Run service was configured with a CPU utilization scaling threshold of 80% (delaying autoscaling triggers) and a maximum instance cap of 10, insufficient for the peak load.
Long instance startup time: The service container image was 1.2GB (bloated with unused dev dependencies and test artifacts), resulting in a 45-second startup time for new instances.
Undetected memory leak: A memory leak in the request handler caused existing instances to crash after ~10 minutes of sustained high load, further reducing available capacity.

Incident Timeline (UTC)

14:22: Traffic surge begins, 500 error rate starts climbing from baseline 0.2%
14:25: Error rate hits 72% peak, on-call engineer alerted via PagerDuty
14:27: Engineer confirms 500 errors originate from Cloud Run service, begins metric review
14:30: Identifies instance count stuck at 2, existing instances at 95% CPU utilization
14:32: Manually increases max instance cap to 20, lowers CPU scaling threshold to 60%
14:35: New instances begin spinning up, error rate drops to 45%
14:40: 8 instances running, error rate down to 15%
14:45: Memory leak causes 2 existing instances to crash, error rate ticks up to 22%
14:50: Engineer deploys emergency memory leak patch, rolls out to 10% of instances first
14:57: Full rollout of memory leak fix completed, all instances stable
15:07: Error rate returns to baseline 0.2%, incident declared resolved

Mitigation Steps

During the incident, the on-call team took the following immediate actions:

Manually adjusted Cloud Run autoscaling parameters (max instances to 20, CPU threshold to 60%)
Deployed emergency patch to fix the request handler memory leak
Temporarily rate-limited incoming traffic from the marketing campaign source to 1.5x normal load

Follow-Up Actions

To prevent recurrence, the team has committed to the following long-term fixes:

Implement a pre-launch traffic coordination process: all marketing campaigns and feature launches require 48 hours notice to the infrastructure team, including expected traffic estimates.
Optimize the container image: remove unused dependencies and test artifacts to reduce image size from 1.2GB to <300MB, cutting startup time to <15 seconds.
Update Cloud Run autoscaling config: set CPU threshold to 60%, max instances to 30, add concurrent request count as a secondary scaling metric.
Add memory utilization alerts for Cloud Run instances, triggering warnings at 70% usage and critical alerts at 85%.
Add load testing to CI/CD pipelines: simulate 3x normal traffic for all production-bound deployments.
Implement circuit breakers for all downstream service dependencies to prevent cascading failures.

Conclusion

This incident highlighted gaps in cross-team communication, autoscaling configuration, and testing rigor. By addressing each root cause through the follow-up actions above, we expect to significantly reduce the risk of similar instance exhaustion incidents in the future.