DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: Cloud Run Instance Exhaustion Caused a 500 Error Spike

Postmortem: Cloud Run Instance Exhaustion Caused 500 Error Spike

Executive Summary

On May 15, 2024, a production Cloud Run-hosted user authentication service experienced severe instance exhaustion, leading to a 72% spike in 500 Internal Server Error responses over a 45-minute window. The incident impacted ~12,000 user requests and degraded performance for three downstream high-traffic services. No data loss or corruption was reported.

Incident Impact

The incident lasted from 14:22 UTC to 15:07 UTC on May 15, 2024 (total 45 minutes). Key impact metrics:

  • Baseline 500 error rate: 0.2% → Peak error rate: 72%
  • Affected services: User Auth API, Payment Processing Endpoint, Dashboard Data Service
  • Total failed requests: ~12,000
  • No customer data loss or financial impact reported

Root Cause Analysis

Investigation revealed the incident was caused by a combination of four compounding factors:

  1. Uncommunicated traffic surge: An unannounced marketing campaign launched at 14:20 UTC drove 2x normal traffic to the service, with no prior notice to the infrastructure team.
  2. Inadequate autoscaling configuration: The Cloud Run service was configured with a CPU utilization scaling threshold of 80% (delaying autoscaling triggers) and a maximum instance cap of 10, insufficient for the peak load.
  3. Long instance startup time: The service container image was 1.2GB (bloated with unused dev dependencies and test artifacts), resulting in a 45-second startup time for new instances.
  4. Undetected memory leak: A memory leak in the request handler caused existing instances to crash after ~10 minutes of sustained high load, further reducing available capacity.

Incident Timeline (UTC)

  • 14:22: Traffic surge begins, 500 error rate starts climbing from baseline 0.2%
  • 14:25: Error rate hits 72% peak, on-call engineer alerted via PagerDuty
  • 14:27: Engineer confirms 500 errors originate from Cloud Run service, begins metric review
  • 14:30: Identifies instance count stuck at 2, existing instances at 95% CPU utilization
  • 14:32: Manually increases max instance cap to 20, lowers CPU scaling threshold to 60%
  • 14:35: New instances begin spinning up, error rate drops to 45%
  • 14:40: 8 instances running, error rate down to 15%
  • 14:45: Memory leak causes 2 existing instances to crash, error rate ticks up to 22%
  • 14:50: Engineer deploys emergency memory leak patch, rolls out to 10% of instances first
  • 14:57: Full rollout of memory leak fix completed, all instances stable
  • 15:07: Error rate returns to baseline 0.2%, incident declared resolved

Mitigation Steps

During the incident, the on-call team took the following immediate actions:

  • Manually adjusted Cloud Run autoscaling parameters (max instances to 20, CPU threshold to 60%)
  • Deployed emergency patch to fix the request handler memory leak
  • Temporarily rate-limited incoming traffic from the marketing campaign source to 1.5x normal load

Follow-Up Actions

To prevent recurrence, the team has committed to the following long-term fixes:

  1. Implement a pre-launch traffic coordination process: all marketing campaigns and feature launches require 48 hours notice to the infrastructure team, including expected traffic estimates.
  2. Optimize the container image: remove unused dependencies and test artifacts to reduce image size from 1.2GB to <300MB, cutting startup time to <15 seconds.
  3. Update Cloud Run autoscaling config: set CPU threshold to 60%, max instances to 30, add concurrent request count as a secondary scaling metric.
  4. Add memory utilization alerts for Cloud Run instances, triggering warnings at 70% usage and critical alerts at 85%.
  5. Add load testing to CI/CD pipelines: simulate 3x normal traffic for all production-bound deployments.
  6. Implement circuit breakers for all downstream service dependencies to prevent cascading failures.

Conclusion

This incident highlighted gaps in cross-team communication, autoscaling configuration, and testing rigor. By addressing each root cause through the follow-up actions above, we expect to significantly reduce the risk of similar instance exhaustion incidents in the future.

Top comments (0)