DEV Community

Cover image for I Built an AI Agent That Thinks Out Loud While Using Your APIs—Here's the Non-Obvious Part
Shivnath Tathe
Shivnath Tathe

Posted on

I Built an AI Agent That Thinks Out Loud While Using Your APIs—Here's the Non-Obvious Part

How MCP + Claude turned a boring hotel search into a reasoning machine


You've probably seen the demos. An AI assistant that calls tools, fetches data, and returns an answer. Clean. Impressive. But here's what those demos don't show you — the part where the model thinks between tool calls.

That's what I actually built. And the bugs I hit along the way taught me more about agentic AI than any tutorial ever could.


What Is MCP and Why Should You Care

The Model Context Protocol (MCP) is Anthropic's open standard for connecting AI models to external tools and data sources. Think of it as a USB-C port for AI—one protocol, any tool.

Instead of hardcoding tool logic into your LLM app, you expose tools via an MCP server. The model discovers them, decides when to use them, calls them, and reasons about the results. Your backend stays clean. Your AI gets smarter.

I built an MCP server on top of a corporate travel API—hotels, weather, city lookups—and connected it to a Claude-powered chat interface. What came out the other side surprised me.


The Part Nobody Talks About—Agentic Loops

Most tutorials show you a single tool call. User asks, model calls tool, model answers. Done.

Real agentic behavior is messier and far more interesting.

Here is what actually happens when a user asks, "Is Mumbai good to visit this month with the best hotels?"

User:    Is Mumbai good to visit this month? Find best hotels too.

Claude:  I'll help plan your trip. Let me first look up Mumbai's city code,
         then fetch the weather and hotel listings.

[calls city_lookup -> gets city code MUI]

Claude:  Found Mumbai with code MUI. Weather looks great — let me pull hotels now.

[calls get_weather_forecast -> 5 day forecast]
[calls get_hotels -> 20 luxury properties]

Claude:  Perfect. Here's your complete trip summary...
         [full structured answer with weather table + hotel list]
Enter fullscreen mode Exit fullscreen mode

The model narrates what it is about to do, executes the tool, interprets the result, decides if it needs more data, and continues. That is not a chatbot. That is a reasoning agent.


The Bug That Took Half a Day to Find

When I first implemented streaming, tool calls were silently disappearing. The model would narrate perfectly—"Let me search for hotels..."*—then stop. No tool call. No error. Just silence.

The culprit was buried in the Anthropic Python SDK.

When a tool-use block completes during streaming, the SDK emits a ParsedContentBlockStopEvent. My code was only handling RawContentBlockStopEvent. Two different event types. One character difference in a string comparison. Half a day gone.

# What I expected
elif etype == "RawContentBlockStopEvent":
    # handle tool completion

# What the SDK actually sends for tool_use blocks
elif etype == "ParsedContentBlockStopEvent":
    block = event.content_block
    if block.type == "tool_use":
        # THIS is where the tool data lives
        tool_block = {
            "type": "tool_use",
            "id": block.id,
            "name": block.name,
            "input": block.input,
        }
Enter fullscreen mode Exit fullscreen mode

The fix was five lines. The discovery was everything.


The Architecture That Actually Works

Here is the full stack I landed on after a lot of iteration:

Frontend (React + Vite)
    │  SSE stream
Backend (FastAPI)
    │  Anthropic API (claude-sonnet-4-5)
    │  MCP client
MCP Server (FastAPI + MCP SDK)
    │  REST APIs (Hotels, Weather, City Lookup)
Enter fullscreen mode Exit fullscreen mode

The key design decision—the backend runs a multi-turn agentic loop:

for iteration in range(MAX_ITERATIONS):
    async for event in llm.stream_agentic(messages, tools, system):
        if event["type"] == "text_delta":
            yield to_frontend(event["text"])   # stream narration live
        elif event["type"] == "tool_use":
            tool_use_events.append(event)
        elif event["type"] == "end":
            stop_reason = event["stop_reason"]

    if stop_reason != "tool_use":
        break                                  # model is done

    # execute tools, feed results back, loop again
    messages.append({"role": "assistant", "content": content_blocks})
    messages.append({"role": "user", "content": tool_results})
Enter fullscreen mode Exit fullscreen mode

Every iteration, narration streams to the frontend in real time. Tools execute server-side. Results feed back into the next iteration. The loop continues until the model says it is done.


The Frontend Problem Nobody Mentions

Streaming an agentic response to a UI is harder than it looks.

The naive approach — separate arrays for tool calls and text — breaks immediately. When the model narrates, calls a tool, narrates again, calls another tool, then gives a final answer, you need the UI to reflect that exact sequence.

I refactored the Turn data model from this:

// Bad -- loses ordering information
interface Turn {
  toolCalls: ToolCall[]       // all tools, no position
  assistantMessage: string    // all text, no position
}
Enter fullscreen mode Exit fullscreen mode

To this:

// Good -- preserves arrival order
type ContentBlock =
  | { kind: "text"; text: string }
  | { kind: "tool"; tool: ToolCall }

interface Turn {
  blocks: ContentBlock[]      // interleaved, in order
}
Enter fullscreen mode Exit fullscreen mode

Now the UI renders exactly what happened—narration, tool call, more narration, another tool call, and final answer—in the sequence the model produced.


Conversation Memory Across Tool Calls

Here is the subtle bug that broke follow-up questions.

After a full agentic turn I was saving history like this:

# Wrong -- loses all tool data
await append_message(session_id, Message(role="user", content=user_text))
await append_message(session_id, Message(role="assistant", content=response_text))
Enter fullscreen mode Exit fullscreen mode

So when a user asked *"Give me those hotels in a table," Claude had no idea what hotels they meant. The formatted summary text was there but the actual JSON tool results were gone.

The fix—save the full message array, including every tool use block and tool result:

# Right -- full context preserved
await set_messages(session_id, messages)
# messages contains: user -> assistant+tool_use -> tool_result -> assistant -> ...
Enter fullscreen mode Exit fullscreen mode

Now Claude can answer follow-up questions using the actual data from previous tool calls. No re-fetching, no guessing.


What Surprised Me Most

The model is a better orchestrator than I expected. Given a vague question like "Is Mumbai good to visit?", it independently decides to look up the city code first, then the weather, then hotels—in the right order, with the right parameters—without being told the sequence.

That is not prompt engineering. That is the model reasoning about what it needs and going to get it.

MCP makes this possible because the tools are described, not hardcoded. The model reads the tool definitions and figures out the rest.


Try It Yourself

The MCP server exposes tools over HTTP using the streamable-HTTP transport—no websockets, no special infrastructure. Any HTTP client can call it.

# Your MCP server is just a FastAPI app
uvicorn main:app --host 0.0.0.0 --port 8000

# Claude Desktop connects via config
{
  "mcpServers": {
    "my-server": {
      "url": "http://localhost:8000/mcp"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Where This Goes Next

A few things I am actively working on:

  • Precise vs Expressive mode—user-configurable system prompts so the model gives short direct answers or detailed, friendly responses depending on context
  • City code disambiguation—when a user says "Dubai," the model should confirm which Dubai before searching
  • Smooth streaming—React 18 automatic batching creates interesting problems when you want character-level typewriter effects on SSE streams

MCP is still early. The tooling is moving fast. But the core idea—give the model well-described tools and let it reason about when and how to use them—is already working in production.


About the Author

Shivnath Tathe—Software Engineer and Independent Researcher working at the intersection of LLMs and production systems. Published work on 4-bit quantized neural network training and continual learning on arXiv.

If you are building something with MCP or agentic AI, I would love to connect.

Top comments (0)