Postmortem: Cloud Run Instance Exhaustion Caused 500 Error Spike
Executive Summary
On May 15, 2024, a production Cloud Run-hosted user authentication service experienced severe instance exhaustion, leading to a 72% spike in 500 Internal Server Error responses over a 45-minute window. The incident impacted ~12,000 user requests and degraded performance for three downstream high-traffic services. No data loss or corruption was reported.
Incident Impact
The incident lasted from 14:22 UTC to 15:07 UTC on May 15, 2024 (total 45 minutes). Key impact metrics:
- Baseline 500 error rate: 0.2% → Peak error rate: 72%
- Affected services: User Auth API, Payment Processing Endpoint, Dashboard Data Service
- Total failed requests: ~12,000
- No customer data loss or financial impact reported
Root Cause Analysis
Investigation revealed the incident was caused by a combination of four compounding factors:
- Uncommunicated traffic surge: An unannounced marketing campaign launched at 14:20 UTC drove 2x normal traffic to the service, with no prior notice to the infrastructure team.
- Inadequate autoscaling configuration: The Cloud Run service was configured with a CPU utilization scaling threshold of 80% (delaying autoscaling triggers) and a maximum instance cap of 10, insufficient for the peak load.
- Long instance startup time: The service container image was 1.2GB (bloated with unused dev dependencies and test artifacts), resulting in a 45-second startup time for new instances.
- Undetected memory leak: A memory leak in the request handler caused existing instances to crash after ~10 minutes of sustained high load, further reducing available capacity.
Incident Timeline (UTC)
- 14:22: Traffic surge begins, 500 error rate starts climbing from baseline 0.2%
- 14:25: Error rate hits 72% peak, on-call engineer alerted via PagerDuty
- 14:27: Engineer confirms 500 errors originate from Cloud Run service, begins metric review
- 14:30: Identifies instance count stuck at 2, existing instances at 95% CPU utilization
- 14:32: Manually increases max instance cap to 20, lowers CPU scaling threshold to 60%
- 14:35: New instances begin spinning up, error rate drops to 45%
- 14:40: 8 instances running, error rate down to 15%
- 14:45: Memory leak causes 2 existing instances to crash, error rate ticks up to 22%
- 14:50: Engineer deploys emergency memory leak patch, rolls out to 10% of instances first
- 14:57: Full rollout of memory leak fix completed, all instances stable
- 15:07: Error rate returns to baseline 0.2%, incident declared resolved
Mitigation Steps
During the incident, the on-call team took the following immediate actions:
- Manually adjusted Cloud Run autoscaling parameters (max instances to 20, CPU threshold to 60%)
- Deployed emergency patch to fix the request handler memory leak
- Temporarily rate-limited incoming traffic from the marketing campaign source to 1.5x normal load
Follow-Up Actions
To prevent recurrence, the team has committed to the following long-term fixes:
- Implement a pre-launch traffic coordination process: all marketing campaigns and feature launches require 48 hours notice to the infrastructure team, including expected traffic estimates.
- Optimize the container image: remove unused dependencies and test artifacts to reduce image size from 1.2GB to <300MB, cutting startup time to <15 seconds.
- Update Cloud Run autoscaling config: set CPU threshold to 60%, max instances to 30, add concurrent request count as a secondary scaling metric.
- Add memory utilization alerts for Cloud Run instances, triggering warnings at 70% usage and critical alerts at 85%.
- Add load testing to CI/CD pipelines: simulate 3x normal traffic for all production-bound deployments.
- Implement circuit breakers for all downstream service dependencies to prevent cascading failures.
Conclusion
This incident highlighted gaps in cross-team communication, autoscaling configuration, and testing rigor. By addressing each root cause through the follow-up actions above, we expect to significantly reduce the risk of similar instance exhaustion incidents in the future.
Top comments (0)