Postmortem: A Vercel Edge Function Timeout Caused Our Global API to Fail for 30 Minutes
On October 17, 2024, at 14:22 UTC, our global API experienced a complete outage lasting 30 minutes, caused by a misconfigured timeout setting on a Vercel Edge Function. This postmortem details the incident timeline, root cause, resolution steps, and long-term prevention measures.
Incident Timeline (All Times UTC)
- 14:22: Monitoring alerts trigger for 100% error rate on global API endpoints. All user requests return 504 Gateway Timeout errors.
- 14:24: On-call engineering team acknowledges the alert and begins investigating. Initial checks show Vercel Edge Function logs reporting timeout errors for the core API handler.
- 14:27: Team identifies the recent deployment of a configuration change to the Edge Function that reduced the execution timeout from 10 seconds to 1 second, below the average 2.8-second response time of downstream database queries.
- 14:29: Team rolls back the Edge Function configuration to the previous 10-second timeout setting.
- 14:30: API health checks return to 100% success rate. User access is fully restored.
- 14:45: Post-incident review kickoff meeting held to document findings.
Root Cause
The outage was caused by a human error during a routine performance optimization deployment. An engineer incorrectly updated the maxDuration setting for the core Vercel Edge Function from 10 seconds to 1 second, intending to reduce idle timeout waste. However, the team failed to account for the average 2.8-second response time of downstream PostgreSQL queries and third-party auth service calls, which regularly exceeded the 1-second threshold. Vercel Edge Functions terminate execution when the timeout is reached, returning 504 errors to all callers.
Contributing Factors
- Lack of pre-deployment validation for Edge Function timeout settings against historical performance metrics.
- No staged rollout for configuration changes to Edge Functions, leading to immediate global propagation of the misconfiguration.
- Insufficient alerting for Edge Function timeout threshold breaches before 100% error rate was reached.
Impact
- 100% of global API requests failed with 504 errors for 30 minutes (14:22–14:52 UTC).
- All user-facing products (web app, mobile apps, third-party integrations) were inaccessible during the outage.
- Estimated 12,000 failed user requests; no data loss occurred.
Resolution
The team resolved the incident by rolling back the Edge Function configuration to the previous 10-second timeout setting at 14:29 UTC. Full service restoration was confirmed at 14:30 UTC after 1 minute of health check validation. No additional code changes were required, as the issue was purely configuration-based.
Follow-Up Actions
To prevent similar incidents, we have implemented the following changes:
- Added pre-deployment checks for Edge Function timeout settings, which compare the new timeout value against the 95th percentile of historical execution times for the past 7 days. Deployments will block if the timeout is set below the 95th percentile.
- Mandated staged rollouts for all Edge Function configuration changes, starting with 1% of traffic, then 10%, then 100%, with automatic rollback if error rates exceed 1%.
- Added new Vercel Edge Function alerts for timeout rates above 5%, triggering immediate Slack notifications to the on-call team before full outage occurs.
- Updated internal deployment training to include a mandatory review step for all timeout and execution limit changes.
- Added a runbook for Edge Function timeout incidents, including step-by-step rollback procedures.
Top comments (0)