DEV Community: erlangb

A Movie Finder with AI Reflexion using GoLang

erlangb — Wed, 11 Mar 2026 21:34:36 +0000

Introduction: The "Vibes-Based" Engineering Trap

We’ve all been there. You ask an LLM for "underground 80s sci-fi," and it starts strong with Blade Runner (hardly underground). Then, desperate to please, it hallucinations: "Have you seen Neon Shadows (1984)?" It sounds perfect. It sounds real. It doesn’t exist.

In a side project, that’s a "lol." In production, that’s a total failure.

The Problem: The Confidence Gap

LLMs aren't stupid; they are just pathologically helpful. They prioritize being pleasant over being factual because they lack a Skepticism Layer. Most developers fall into the trap of Linear Prompting:

Send request.

Hope for the best.

But hope is not an engineering strategy. To build reliable Agentic AI, we need to move from "sending prompts" to "building pipelines that verify."

The Solution: Reflexion and Orchestration

To solve this for my "Movie Finder" use case, I didn't just write a better prompt. I implemented the Reflexion Pattern: an architectural loop where one agent's output is treated as a "draft" that must survive a rigorous audit by a second, skeptical agent.

To bridge this gap, I used the EINO framework. EINO (pronounced 'ay-no') is a Go-native orchestration framework designed specifically for LLM workflows. It allows you to model complex agentic logic as a graph of nodes, which was perfect for implementing the Reflexion Pattern.

🛠️ Open Source & Local Setup
The full source code for this project is available on GitHub. To visualize and monitor the agent's reasoning steps, I used two libraries I developed:

agent_monitor: The core Go project to run usecases
agentmeter: a library for capturing and printing agent internals.

"Why did I build an AI agent in Go instead of Python?
The honest answer: It's what I know. But beyond familiarity, I wanted to explore the current 'state of the art' for Agentic AI in the Go ecosystem.

In this article, I’ll show you how I moved beyond simple prompts to a dual-agent system. By pitting a Cinephile against a Clerk, I’ve built an adversarial loop where agents "argue" their way toward grounded truth.

It’s not just about getting an answer; it’s about building a system that uses systematic skepticism to virtually eliminate hallucinations.

The Concept: What is the Reflexion Pattern?

At its core, the Reflexion Pattern is a design pattern for LLM agents that introduces a "self-correction" loop.
Think of a standard AI agent as a solo freelancer working without an editor. They produce work, and you get what you get. The Reflexion Pattern turns that solo freelancer into a team of two: one who creates, and one who audits.

How it works (The 3-Step Dance)

In my movie finder, the loop follows a specific cycle of Generation, Critique, and Correction:

Generation (The Draft): The first agent (The Cinephile) receives the user's request and generates a response. It operates purely on its internal training data—which, as we know, can be prone to "stochastic dreaming" (hallucinations).
Critique (The Fact-Check): Instead of showing the user the result, the output is passed to the second agent (The Clerk). This agent is given a specific "Skeptic" persona and, crucially, access to External Tools (in this case, the Tavily Search API). Its only job is to find reasons why the first agent might be wrong.
Correction (The Iteration): If the Clerk finds an error, it doesn't just fail the process. It generates a feedback signal—a structured message explaining what was wrong and why. This feedback is fed back into the first agent, which now has a "second chance" to get it right.

The Architecture: Mapping the Graph

To implement the Reflexion pattern in Go, I used EINO's Graph composition. This allows us to treat our agents as independent nodes connected by edges, including a conditional "branch" that creates our self-correction loop.

The Pipeline Logic

Here is the simplified implementation of the FindMoviesPipeline. Notice how the "Loop" isn't a complex for loop in the code, but a visual branch in the graph:

func NewFindMoviesPipeline(
    ctx context.Context,
    cinephile *CinephileAgent,
    clerk *ClerkAgent,
    curator *CuratorChain,
) (FindMoviesPipeline, error) {
    // 1. Initialize the Graph with a shared State
    g := compose.NewGraph[*appmodel.FindMoviesState, *appmodel.FindMoviesState]()

    // 2. Add the Agent Nodes
    g.AddLambdaNode("cinephile", compose.InvokableLambda(cinephile.Invoke))
    g.AddLambdaNode("clerk", compose.InvokableLambda(clerk.Invoke))
    g.AddLambdaNode("curator", compose.InvokableLambda(curator.Invoke))

    // 3. Define the linear flow
    g.AddEdge(compose.START, "cinephile")
    g.AddEdge("cinephile", "clerk")

    // 4. The Reflexion Branch: The "Skeptic" decides where to go next
    branch := compose.NewGraphBranch(
       func(ctx context.Context, state *appmodel.FindMoviesState) (string, error) {
          // If the Clerk is happy OR we've tried too many times, move to curation
          if state.IsSatisfied || state.RetryCount >= state.MaxRetries {
             return "curator", nil
          }
          // Otherwise, send it back to the Cinephile for correction
          return "cinephile", nil
       },
       map[string]bool{"curator": true, "cinephile": true},
    )

    g.AddBranch("clerk", branch)
    g.AddEdge("curator", compose.END)

    return g.Compile(ctx, compose.WithGraphName("find_movies_graph"))
}

Why this code matters:

The State Object: The FindMoviesState acts as the "Short-term Memory." It carries the current list of movies, the Clerk's critiques, and the RetryCount.
Decoupled Logic: The cinephile doesn't know the clerk exists. It just knows it receives a state and returns a state. This makes testing individual agents much easier.
The Branch is the Brain: The NewGraphBranch function is where the Reflexion happens. It forces the system to be honest. If state.IsSatisfied is false, the data is physically impossible to reach the curator node.

The Architecture: Inside the FindMoviesUseCase

In my FindMoviesUseCase, the data moves through four distinct stages.

The Node Breakdown

1. The Refiner (The Translator)
Before any "thinking" happens, we need structure. The RefinerChain takes the user's messy, natural language input—"I want some weird 70s space movies that feel like David Bowie's music"—and converts it into a clean Go struct.

Output: Structured parameters like primary_genre, secondary_genres, end_year, start_year

User Input: > "I want some weird 70s space movies that feel like David Bowie's music"

RefinerChain Output:

{
   "primary_genre": "science fiction",
   "secondary_genres": ["weird", "space", "musical vibe"],
   "start_year": 1970,
   "end_year": 1979,
   "is_classic": false,
   "original_text": "I want some weird 70s space movies that feel like David Bowie's music",
   "query_info": "weird space science fiction 1970s Bowie vibe"
}

2. The Cinephile (The Creative Brain)
This is our primary Generator. It uses the refined parameters to search its internal knowledge and propose a curated list of films.

The Risk: This is where hallucinations live. If the LLM "remembers" a movie that doesn't exist, it will confidently include it here.

3. The Clerk (The Auditor & Tool User)
This is the heart of the Reflexion Loop. The Clerk is a "Skeptic" node equipped with the Tavily Search tool.

The Process: It takes every movie from the Cinephile's list and verifies it against the real world.
The Decision: If the movies match the user query is set satisfied true, otherwise it returns the list of critiques for every movie found. If one or more movies have passed the critique, it the Cinephile will keep them, and add more movies to be verified by the clerk.

4. The Curator (The Final Editor)
Once the loop is broken (either through success or reaching the MaxRetries limit), the data hits the CuratorChain.

It performs a final prune. It read the last clerk response and movie list and finalize the result to the end user.

The Loop in Action: A Real-World "Argument"

To see the value of the Reflexion pattern, we have to look at how the agents interact when things go wrong. In this example, I asked the system for:
"Italian movies from the late 90s about the new millennium."

Round 1: The Cinephile's "Stochastic Dreaming"

The CinephileAgent generated three suggestions. They looked plausible, but there were hidden hallucinations:

Ecco fatto (1998)
Tutto l'amore che c'è (1999)
Luna e l'altra (1996)

Round 2: The Clerk's Skepticism (Tavily Search)

The ClerkAgent immediately triggered a series of parallel Tavily searches to verify these titles.
It didn't just check if the movies existed; it checked the metadata against the user's specific constraints (Year: 1997–1999). Here is the "Correction Note" it generated:
Clerk: "isSatisfied: false. 'Tutto l'amore che c'è' is actually a 2000 film, not 1999. 'Luna e l'altra' is from 1996, which is outside the requested range. Replace these."

Round 3: The Correction

The graph routed the state back to the Cinephile. Crucially, the Cinephile was "aware" of its previous mistakes because they were stored in the shared FindMoviesState.
Cinephile: "Previous critiques to fix: Replace Tutto l'amore... and Luna e l'altra." It kept Ecco fatto (which passed) and proposed new candidates: Cose che non ti ho mai detto and I piccoli maestri.

Round 4: Deep Verification

The Clerk is a tough critic. It rejected the new suggestions too, but for deeper reasons:

National Identity: It caught that Cose che non ti ho mai detto is actually a Spanish-American film, not Italian.
Thematic Alignment: It caught that I piccoli maestri is a WWII resistance film—technically Italian and from 1998, but it has nothing to do with the "new millennium" theme.

The Final Result: Deterministic Success

Finally, the loop closed on a verified, accurate list:

Ecco fatto (1998)
Tutti giù per terra (1997)

Without the Reflexion loop, the user would have received a list where 66% of the data was technically wrong.

Following a slider where you can see all the output.

Conclusion: From "Stochastic" to "Deterministic"

LLMs are "stochastic" (probabilistic) by nature. They are built to predict the next word, not to tell the truth. By implementing the Reflexion Pattern, we transform that probability into a more "deterministic" system. If the Clerk doesn't find a factual match on the web, the data simply does not pass.

Verification is the New Optimization

Instead of spending weeks fine-tuning a model or "begging" a prompt to be accurate, we can achieve better results by giving agents tools (like Tavily) and feedback. The "Cinephile vs. Clerk" interaction proves that two specialized agents working in a loop will always outperform a single "Generalist" agent trying to do everything at once.

Final Thoughts

Building this wasn't just about finding niche Italian movies; it was about exploring how we can trust the software we build in the age of AI. If you are a Go developer, don't wait for a "Python-equivalent" to emerge. The tools are already here.
The next time your LLM hallucinates, don't just change the prompt. Change the architecture.

The Hidden Cost of MCP Tools: a 2.5x Token Reduction to Save 50% in Costs

erlangb — Wed, 11 Mar 2026 00:38:27 +0000

Introduction

I want to be clear: I'm not an AI guru. I'm just a developer running experiments with "agentic" programming in Go, trying to see what actually works once you move past the "Hello World" phase.

After finishing this course about Langchain and LangGraph, I wanted to find a practical way to build an agent. Since most of my experience is in Go, I started exploring the Go ecosystem and came across the EINO framework.

Almost immediately, I hit a wall: how do you actually keep track of the steps, actions, and results in a system where more than one actor is involved? I started with the usual approach—debugging and following logs—but I quickly realized that logs weren't enough to see the full picture.

To help me see what was happening under the hood, I built two small projects:

agent_monitor: A Go observability playground for inspecting and running agentic pipelines.
agentmeter: A library specifically designed to track tokens and reasoning traces

I'm not mentioning these to show off the code, but just because I used these projects to run different experiments.

This is the first in a series of articles where I'll share what I'm learning about building agents in Go (though the logic applies to any language). In this post, we're looking at MCP tool optimization. In the next one, I'll dive into a movie reflection system I built to help agents double-check their own decisions and limit hallucinations.

The "Aha!" Moment

While using these tools to inspect my own agents, I noticed something. Initially, I took the easy route: I injected a raw MCP client connection directly into the Agent. Since an MCP connection is designed to expose all available tools, I figured, "Let the agent have everything; it's smart enough to handle it."

I was wrong. After running tests with Tavily Search and MapBox MCPs, I realised that giving an agent raw, unfiltered access to an MCP connection is usually a bad idea.

If you're an expert in the field, this first post might seem trivial. But if you're just starting to approach MCP, I hope these findings save you some time. Even for a simple pipeline, you must consider:

A Tool Filter: To control exactly which tools the agent can see for a specific task.
A Tool Overlay: A custom layer that uses a "Tolerant Reader" approach to prune the tool's response before the LLM ever sees it.

Let's dive into the Tool Overlay. The tool filtering it's also important to dont overload the agent context with tools you don't need.

The Setup: A Simple Test

I set up an instructed LLM travel agent with one job: "Suggest 10 places to visit in Rome." I used the Tavily MCP search tool to get the data.

I ran the experiment twice:

Run 1 (The Lazy Way): I let the agent use the MCP client to access the tool directly.
Run 2 (The Structured Approach): I added a layer between the tool and the LLM to parse the response and remove the fields I didn't need.

Run 1: The Raw MCP Response

MCP tools return verbose responses by design. Here is a snippet of what Tavily actually sends back for a single search:

{
  "query": "top 10 places to visit in Florence",
  "results": [
    {
      "title": "Top 10 Must See Places in Florence..",
      "url": "https://www.romecabs.com/blog/docs/top-10-must-see-places-rome/",
      "content": "**Piazza del Duomo**, the **Gallery of the Academy**, **Uffizi** **Gallery**",
      "score": 0.80405265,
      "raw_content": "...",
      "favicon": "..."
    }, 
    ....
  ],
  "response_time": "1.67",
  "auto_parameters": { "topic": "travel", "search_depth": "basic" },
  "usage": { "credits": 1 },
  "request_id": "123e4567-e89b-12d3-a456-426614174111"
}

Reasoning & Results (Run 1):

The "Tolerant Reader" Overlay

In my second run, I defined a custom tool using the same
Tavily MCP connection but overridden the search function, unmarshaling only the content and score fields.

// TavilyResult holds only the fields we care about
type TavilyResult struct {
    Content string  `json:"content"`
    Score   float64 `json:"score"`
}

// ... inside the tool call ...
var resp model.TavilySearchResponse
// We Unmarshal only the essential fields
if err := sonic.UnmarshalString(mcpRawResponse, &resp); err != nil {
    return "", err
}

out, _ := sonic.Marshal(resp.Results)
return string(out), nil

Reasoning & Results (Run 2):

Analyze the Results

When I looked at the output, the difference was significant—especially considering this is just a single, simple interaction.

Metric	Raw MCP (`tavily_raw`)	Parsed (`tavily_parsed`)	Delta
Tool payload	12,184 b	5,115 b	2.4x smaller (−58%)
Tokens in	4,157	1,446	2.9x fewer
Cost	$0.0102	$0.0051	50% cheaper

Test: "Suggest 10 places to visit in Florence" — model: gpt-4.1

The MCP Overhead in Numbers

I repeated the experiment several times, and the results were consistent. The 7,069 bytes stripped per tool call are not just wasted bandwidth—they are directly converted into input tokens that the LLM must read, price, and fit into its context window.

The raw Tavily response carries fields the agent simply doesn't need for this task: title, url, favicon, raw_content, request_id, response_time, auto_parameters, and usage. Once you remove them, the payload drops by 58%, mapping almost perfectly to the 2.9x token reduction.

Why a "Small" Saving Matters

In a complex system, a 2.9x token reduction compounds quickly. If your agent makes 10 tool calls in a single session, you aren't just saving a few cents—you are effectively preventing your context window from exploding. By keeping the input lean, you leave more room for the actual reasoning and long-term memory the agent needs to finish the job.

Real-world systems work in loops: the agent searches, reasons, calls another tool, summarizes, and then responds. A 50% cost reduction on one call might look like pocket change, but we rarely stop at one.

Conclusion

MCP servers return everything because they are built for interoperability. They don’t know if your agent is a travel bot or a data scientist, so they send the "kitchen sink" to be safe.

However, as a developer, the space between that server and your Agent is your responsibility. There is a clear tradeoff here: if you prune too much data, you might limit the agent's capacity to find unexpected connections. But for most specific tasks, forcing an LLM to read favicon URLs and request_ids is just paying a "tax for noise."

This becomes even more critical in an enterprise environment where you might be wrapping your own internal APIs with an MCP server. It is increasingly evident that MCP tools should be wrapped and executed in a layer outside the agent's direct context.

If you want to dive deeper into the "Tool Overload" problem, these articles were instrumental in my research:

I'm still just scratching the surface of agentic programming in Go, but this was an important lesson: don't just "plug and play" your MCP tools. Apply a layer between MCP and agents and save the tokens for the reasoning that actually matters.

In the next post, I'll dive into how I built a "movie reflection system" to improve agent accuracy and reduce hallucinations.

Check my work

The code for both variants is available in agent_monitor.
This can be used to run pre-filled simple use-cases, or write your own using EINO.