DEV Community

Cover image for Fix MCP Timeouts: Async HandleId Pattern
Elizabeth Fuentes L for AWS

Posted on

Fix MCP Timeouts: Async HandleId Pattern

MCP tools freeze AI agents when external APIs are slow, causing 424 errors. The async handleId pattern returns immediately with a job ID and polls for results without blocking.

MCP tool timeout occurs when an AI agent calls a Model Context Protocol (MCP) tool that depends on a slow external API. The tool blocks the agent indefinitely instead of returning an error. The result is a 424 (Failed Dependency) error or a frozen workflow with no user feedback. This post shows the problem with real scenarios and how the async handleId pattern provides immediate responses.

This demo uses Strands Agents with MCP (Model Context Protocol). The async pattern is framework-agnostic and applies to any agent that calls external APIs through MCP.

Working code: github.com/aws-samples/sample-why-agents-fail

Series: Why AI Agents Fail

  1. Context Window Overflow — Memory Pointer Pattern for large data
  2. MCP Tools That Never Respond (this post) — Async pattern for slow external APIs
  3. AI Agent Reasoning Loops — Detect and block repeated tool calls

The Problem: MCP Tools That Never Respond

The Model Context Protocol (MCP) enables AI agents to call external tools. But when those tools depend on slow APIs, the entire agent workflow freezes. The agent waits. The user waits. Nothing happens.

Community observation from Octopus (Resilient AI Agents With MCP, 2025) identifies the core issue: as external system integrations increase, so does the likelihood of failure. Systems become unavailable, slow to respond, or return errors. Agents have no built-in strategy to handle this.

OpenAI Community reports confirm the real-world impact:

  • 424 errors when MCP tools take too long
  • Unresponsive states where requests neither succeed nor fail
  • Tools that pass handshake validation but timeout during execution

Why This Happens

MCP expects tools to respond quickly. When a tool calls a slow external API.

The MCP protocol has implicit timeout expectations. If the tool doesn't respond within ~7-10 seconds, the connection may drop with a 424 (Failed Dependency) error. The agent receives an error instead of data, and the user gets no useful response.

Three failure modes:

  1. Slow API — Tool waits 15+ seconds, poor UX but eventually responds
  2. Failing API — External service unavailable, 424 error after timeout
  3. Unresponsive state — Request accepted but never returns, requires session restart

The Demo: Simulating Real Timeout Scenarios

We built an MCP server that simulates these real-world scenarios:

from mcp.server import FastMCP
import asyncio

# FastMCP is a lightweight MCP server framework — tools are registered with @mcp.tool()
mcp = FastMCP("Timeout Demo Server")

# Baseline: responds in 1s, well within MCP's implicit timeout threshold (~7-10s)
@mcp.tool(description="Fast API - responds in 1 second")
async def fast_api(query: str) -> str:
    await asyncio.sleep(1)
    return f"Fast result for: {query}"

# Problem case: 15s delay exceeds MCP timeout — the agent freezes waiting for this
@mcp.tool(description="Slow API - responds in 15 seconds")
async def slow_api(query: str) -> str:
    await asyncio.sleep(15)  # Simulates a slow external service (data pipeline, batch job)
    return f"Slow result for: {query}"

# Failure case: 7s delay triggers the timeout, then raises Failed Dependency (424)
@mcp.tool(description="Failing API - returns 424 after delay")
async def failing_api(query: str) -> str:
    await asyncio.sleep(7)
    raise Exception("Failed Dependency: External service unavailable")
Enter fullscreen mode Exit fullscreen mode

The Async HandleId Solution

Comparison of synchronous MCP tool call blocked for 17.2 seconds versus async handleId pattern completing in 1.7 seconds

Instead of waiting for slow operations, return immediately with a tracking ID:

import uuid

# In-memory job store: maps job_id → {status, query, result}
# For production, replace with a persistent store (Redis, DynamoDB) for durability across restarts
JOBS = {}

# The handleId pattern: return a tracking ID immediately instead of blocking
@mcp.tool(description="Start a long-running job, returns immediately with job ID")
async def start_async_job(query: str) -> str:
    job_id = str(uuid.uuid4())[:8]  # Short ID the LLM can pass in follow-up calls
    JOBS[job_id] = {"status": "processing", "query": query}

    # Fire-and-forget: slow work runs in background, tool returns before it finishes
    asyncio.create_task(do_work(job_id, query))

    # The agent receives this in < 1s — no timeout, no frozen UI
    return f"Job started: {job_id}. Use check_job_status to poll for results."

# Polling endpoint: the agent calls this repeatedly until status is "completed"
@mcp.tool(description="Check status of a running job")
async def check_job_status(job_id: str) -> str:
    job = JOBS.get(job_id)
    if not job:
        return f"Job {job_id} not found"
    if job["status"] == "completed":
        return f"COMPLETED: {job['result']}"  # Return the actual result to the agent
    return f"PROCESSING: Job {job_id} still running"  # Agent polls again after a short wait
Enter fullscreen mode Exit fullscreen mode

Demo Results

We tested all four scenarios with a Strands Agent connected to the MCP server:

Scenario Response Time User Experience Research Finding
Fast API (1s delay) 3.2s total ✅ Good UX Baseline
Slow API (15s delay) 17.8s total ❌ Poor UX — agent waits Octopus: "agent waits indefinitely"
Failing API (424) 7.7s total ❌ Error after wait OpenAI Community: 424 errors
Async pattern (handleId) 3.7s total ✅ Immediate response Solution: "respond ASAP with handleId"

Bar chart comparing MCP tool response times across fast API, slow API, failing API, and async handleId scenarios

The async pattern transforms a 17.8s wait into a 3.7s immediate response. The agent tells the user "job started" and can check status later, with no frozen UI and no timeout errors.

Why Strands Agents for MCP Integration?

The MCPClient connects to any MCP server in two lines. The agent discovers available tools at runtime through list_tools_sync(), so you don't maintain a hardcoded tool list. When the MCP server implements the async handleId pattern, the agent polls automatically without extra orchestration code.

Strands supports multiple model providers (OpenAI, Amazon Bedrock, Anthropic, Ollama). The MCP timeout patterns shown here work identically across all providers.

When to Use Each Pattern

Direct call (fast tools < 5s):

  • Lookups, calculations, small API calls
  • No timeout risk

Async handleId (slow tools > 5s):

  • External API calls with unpredictable latency
  • Data processing, report generation
  • Any operation that might exceed MCP timeout

Retry with backoff (intermittent failures):

  • Services that occasionally fail but recover
  • Network-dependent operations

Try It Yourself

You need Python 3.9+, uv, and an OpenAI API key. The MCP server runs locally as a subprocess, so no external services are needed.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens/02-mcp-timeout-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"

uv run python test_mcp_timeout.py   # Runs all 4 scenarios
Enter fullscreen mode Exit fullscreen mode

Or open test_mcp_timeout.ipynb in Jupyter, JupyterLab, VS Code, or your preferred notebook environment.

Key Takeaways

  1. MCP tools timeout silently — 424 errors with no recovery
  2. Slow APIs freeze the entire agent — 17.8s wait with no feedback
  3. Async handleId pattern solves it — immediate response, poll for results
  4. Design for failure — every external call can timeout, plan accordingly

Frequently Asked Questions

What causes 424 errors in MCP tool calls?

A 424 (Failed Dependency) error occurs when an MCP tool takes longer than the implicit timeout threshold (typically 7-10 seconds) to respond. The MCP protocol expects tools to return quickly. When an external API blocks the tool beyond this threshold, the connection drops and the agent receives a 424 error instead of data.

When should I use the async handleId pattern instead of a direct MCP tool call?

Use the async handleId pattern for any tool that calls an external API with unpredictable latency: data processing, report generation, third-party service calls, or any operation that might exceed 5 seconds. For fast lookups, calculations, and small API calls under 5 seconds, direct calls work fine.

Does the async handleId pattern work with any MCP server, not only Strands?

Yes. The async handleId pattern is an MCP server design pattern, not a framework feature. Any MCP-compatible agent can call start_long_job and check_job_status tools. The pattern works with OpenAI Agents, LangChain MCP integrations, and any client that supports the Model Context Protocol.

References

Research

Implementation


Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

Top comments (0)