DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: A Vercel Edge Function Timeout Caused Our Global API to Fail for 30 Minutes

Postmortem: A Vercel Edge Function Timeout Caused Our Global API to Fail for 30 Minutes

On October 17, 2024, at 14:22 UTC, our global API experienced a complete outage lasting 30 minutes, caused by a misconfigured timeout setting on a Vercel Edge Function. This postmortem details the incident timeline, root cause, resolution steps, and long-term prevention measures.

Incident Timeline (All Times UTC)

  • 14:22: Monitoring alerts trigger for 100% error rate on global API endpoints. All user requests return 504 Gateway Timeout errors.
  • 14:24: On-call engineering team acknowledges the alert and begins investigating. Initial checks show Vercel Edge Function logs reporting timeout errors for the core API handler.
  • 14:27: Team identifies the recent deployment of a configuration change to the Edge Function that reduced the execution timeout from 10 seconds to 1 second, below the average 2.8-second response time of downstream database queries.
  • 14:29: Team rolls back the Edge Function configuration to the previous 10-second timeout setting.
  • 14:30: API health checks return to 100% success rate. User access is fully restored.
  • 14:45: Post-incident review kickoff meeting held to document findings.

Root Cause

The outage was caused by a human error during a routine performance optimization deployment. An engineer incorrectly updated the maxDuration setting for the core Vercel Edge Function from 10 seconds to 1 second, intending to reduce idle timeout waste. However, the team failed to account for the average 2.8-second response time of downstream PostgreSQL queries and third-party auth service calls, which regularly exceeded the 1-second threshold. Vercel Edge Functions terminate execution when the timeout is reached, returning 504 errors to all callers.

Contributing Factors

  • Lack of pre-deployment validation for Edge Function timeout settings against historical performance metrics.
  • No staged rollout for configuration changes to Edge Functions, leading to immediate global propagation of the misconfiguration.
  • Insufficient alerting for Edge Function timeout threshold breaches before 100% error rate was reached.

Impact

  • 100% of global API requests failed with 504 errors for 30 minutes (14:22–14:52 UTC).
  • All user-facing products (web app, mobile apps, third-party integrations) were inaccessible during the outage.
  • Estimated 12,000 failed user requests; no data loss occurred.

Resolution

The team resolved the incident by rolling back the Edge Function configuration to the previous 10-second timeout setting at 14:29 UTC. Full service restoration was confirmed at 14:30 UTC after 1 minute of health check validation. No additional code changes were required, as the issue was purely configuration-based.

Follow-Up Actions

To prevent similar incidents, we have implemented the following changes:

  • Added pre-deployment checks for Edge Function timeout settings, which compare the new timeout value against the 95th percentile of historical execution times for the past 7 days. Deployments will block if the timeout is set below the 95th percentile.
  • Mandated staged rollouts for all Edge Function configuration changes, starting with 1% of traffic, then 10%, then 100%, with automatic rollback if error rates exceed 1%.
  • Added new Vercel Edge Function alerts for timeout rates above 5%, triggering immediate Slack notifications to the on-call team before full outage occurs.
  • Updated internal deployment training to include a mandatory review step for all timeout and execution limit changes.
  • Added a runbook for Edge Function timeout incidents, including step-by-step rollback procedures.

Top comments (0)