DEV Community: tony tong

From a Git clone to a working MCP server: a 30-minute Codex walkthrough

tony tong — Thu, 11 Jun 2026 16:09:22 +0000

Most MCP tutorials assume you're starting from scratch. In reality, you usually have a working tool or library and just want to expose it as a callable tool to an LLM agent. Here's the path I take that gets it done in 30 minutes of real work.

Step 1: Pick one tool, not five

Don't expose your whole API surface. Pick the one function a coding agent would actually call. For a docs site, that's search_docs. For a database, that's run_query. For an internal service, that's lookup_user. One tool, clear input schema, real value.

Step 2: Write the tool description like a docstring

The model will only call the tool well if the description is sharp. I write three sentences:

What it does (verb + noun)
When to use it (the user signal that should trigger it)
What it returns (shape, not values)

Example:

search_docs(query, top_k=5): Search the company docs index. Use when the user asks a factual question about internal systems, processes, or past decisions. Returns a list of {title, url, snippet} sorted by relevance.

That's the whole tool spec. No ambiguity, no prose.

Step 3: Run Codex locally first

OpenAI Codex CLI is the fastest way to validate the tool works end-to-end:

codex --approval-mode suggest

Drop into a sandboxed directory, ask Codex to use the tool. If it picks the tool, calls it correctly, and uses the result in its answer, you're done. If it doesn't, the description is bad — go back to step 2.

Step 4: Wire it as an MCP server

Now wrap the tool in a real MCP server. The minimum is a FastMCP instance with one @mcp.tool():

from mcp.server.fastmcp import FastMCP
mcp = FastMCP("internal-docs")

@mcp.tool(description="Search the company docs index...")
def search_docs(query: str, top_k: int = 5) -> list[dict]:
    return docs_index.search(query, top_k=top_k)

if __name__ == "__main__":
    mcp.run(transport="stdio")

Add it to your client's MCP config (Claude Desktop, Cursor, etc.). Restart, ask the same question. If it works in both Codex CLI and the host client, you have a real MCP server.

Step 5: Evals, not vibes

The last 10 minutes is an eval. Three questions your tool should answer correctly, three it should refuse to answer. If you can't list them, your tool isn't done — it's just running.

A small diagram

Here's the loop I end up with. Tools, model, results, repeat.

The boring truth is that the description is 80% of the work. Once that's right, the wiring is half an hour.

Why RAG without context judgment is just a fancier grep

tony tong — Thu, 11 Jun 2026 16:07:58 +0000

I keep seeing teams ship a "RAG system" that's really a vector database with a thin wrapper. They measure recall@10, ship to production, and then wonder why the model hallucinates on documents the retriever clearly found.

The retriever is doing its job. The model is doing its job. What's missing is the context judgment layer in between.

Retrieval ≠ selection

Most RAG tutorials stop at "embed the docs, embed the query, cosine, top-k". But cosine is a relevance proxy, not a usefulness proxy. A chunk can be semantically similar to the query and still actively mislead the model:

A pricing table from 2019 that contradicts the 2024 version
A code snippet that solves a similar problem in a different language
A legal disclaimer that looks like a substantive answer

A naive top-k return will mix these in. The model, trained to be helpful, will dutifully stitch them together.

What "context judgment" looks like

The simplest version is a re-ranker: take top-50 from the retriever, score each chunk for answerability (does this chunk actually contain the answer, or just the topic?), keep top-5. A cross-encoder does this well, costs ~50ms per chunk on a small model, and usually lifts answer quality 10-20% on my evals.

The harder version is a judge that filters out chunks the model is likely to misuse. Things like:

Recency checks: drop chunks that pre-date the user's "current" frame
Source authority: prefer internal docs over scraped blog posts
Conflict detection: if two chunks disagree, surface the conflict instead of averaging it

Where it falls apart

The trap is doing this in isolation. If the judge is just another LLM, you've moved the problem one step back. The judge also hallucinates, also misreads, also has its own blind spots. The honest framing is: the judgment layer is where the product lives. A vector DB is a 5-line integration. A trustworthy RAG system is months of work on the judgment layer.

That's the part most RAG marketing glosses over.

A small checklist

When I'm reviewing a team's RAG pipeline, these are the questions that catch the most issues:

What does your retriever's top-k look like, not just its top-1 score? Manually skim 20.
Is the model told which chunks it should prefer to use, and which it should ignore?
Do you have an eval set that includes conflicting sources?
When two chunks disagree, does the system surface the conflict or pick one silently?

If you can't answer these, your RAG system is a demo, not a product.