DEV Community

PicklePixel
PicklePixel

Posted on

How I Built an MCP Server That Lets Claude Code Talk to Every LLM I Pay For

I have subscriptions to ChatGPT Plus, Claude MAX, and Gemini. I also run local models through Ollama. That's four different ecosystems, four browser tabs, and a lot of copy-pasting whenever I want to compare how different models handle the same question.

It was getting ridiculous. I'd ask Claude something, then open ChatGPT to see if GPT-5 agreed, then check Gemini for a third opinion. Every time, I'd lose context, reformat the prompt, and waste five minutes on what should be a ten-second comparison.

So I built HydraMCP. It's an MCP server that routes queries from Claude Code to any model I have access to cloud and local through a single interface. One prompt, multiple models, parallel execution.

What It Actually Does

HydraMCP exposes five tools to Claude Code:

list_models shows everything available across all your providers. One command, full inventory.

ask_model queries any single model. Want GPT-5's take on something without leaving your terminal? Just ask.

compare_models is the one I use the most. Same prompt to 2-5 models in parallel, results side by side. Here's what that looks like in practice:

> compare gpt-5-codex, gemini-3, claude-sonnet, and local qwen on this function review

## Model Comparison (4 models, 11637ms total)

| Model                      | Latency         | Tokens |
|----------------------------|-----------------|--------|
| gpt-5-codex                | 1630ms fastest  | 194    |
| gemini-3-pro-preview       | 11636ms         | 1235   |
| claude-sonnet-4-5-20250929 | 3010ms          | 202    |
| ollama/qwen2.5-coder:14b   | 8407ms          | 187    |
Enter fullscreen mode Exit fullscreen mode

All four independently found the same async bug. Then each caught something different the others missed. GPT-5 was fastest, Gemini was most thorough, Claude gave the clearest fix, Qwen explained the root cause. Different training data, different strengths.

consensus polls 3-7 models on a question and has a separate judge model evaluate whether they actually agree. It returns a confidence score and groups responses by agreement.

synthesize fans out to multiple models, collects their responses, and then a synthesizer model combines the best insights into one answer. The result is usually better than any individual response.

The Architecture

The design is pretty straightforward:

Claude Code
    |
    HydraMCP (MCP Server)
    |
    Provider Interface
    |-- CLIProxyAPI  -> cloud models (GPT, Gemini, Claude, etc.)
    |-- Ollama       -> local models (your hardware)
Enter fullscreen mode Exit fullscreen mode

HydraMCP sits between Claude Code and your model providers. It communicates over stdio using JSON-RPC (the MCP protocol), routes requests to the right backend, and formats everything to keep your context window manageable.

The provider interface is the core abstraction. Every backend implements three methods: healthCheck(), listModels(), and query(). That's it. Adding a new provider means implementing those three functions and registering it.

interface Provider {
  name: string;
  healthCheck(): Promise<boolean>;
  listModels(): Promise<ModelInfo[]>;
  query(model: string, prompt: string, options?: QueryOptions): Promise<QueryResponse>;
}
Enter fullscreen mode Exit fullscreen mode

For cloud models, CLIProxyAPI turns your existing subscriptions into a local OpenAI-compatible API. You authenticate once per provider through a browser login, and it handles the rest. No per-token billing - you're using the subscriptions you already pay for.

For local models, Ollama runs on localhost and provides models like Qwen, Llama, and Mistral. Zero API keys, zero cost beyond your electricity bill.

The Interesting Parts

Parallel Execution

When you compare four models, all four queries fire simultaneously using Promise.allSettled(). Total time equals the slowest model, not the sum of all of them. That five-model comparison above? 11.6 seconds total, not 25+.

And if one model fails, you still get results from the others. Graceful degradation instead of all-or-nothing.

Consensus With an LLM Judge

This is the part I'm most interested in. Naive keyword matching fails at determining if models agree. If one says "start with a monolith" and another says "monolith because it's simpler," they agree - but keyword overlap is low.

So the consensus tool picks a model that's not in the poll and asks it to evaluate agreement. The judge reads all responses and groups them semantically:

Three cloud models polled, local Qwen judging.
Strategy: majority (needed 2/3)
Agreement: 3/3 models (100%)
Judge latency: 686ms
Enter fullscreen mode Exit fullscreen mode

Using a local model as judge means zero cloud quota used for the evaluation step.

Honestly, the keyword-based fallback (for when no judge is available) is pretty broken. It works for factual questions sometimes but falls apart on anything subjective. The LLM judge approach is significantly better, but it's still an area I want to improve.

Ollama Warmup

One thing I noticed during testing: local models through Ollama have a significant cold-start penalty.

First request to Qwen 32B: 24 seconds (loading the model into memory). By the fourth request: 3 seconds. That's an 8x improvement just from the model being warm. After that warmup period, local models genuinely compete with cloud on latency.

If you're using HydraMCP regularly, your local models stay warm and the experience is seamless. The first query of the day might be slow, but everything after that is fast.

Synthesis

The synthesize tool is probably the most ambitious feature. It collects responses from multiple models, then feeds them all to a synthesizer model with instructions to combine the best insights and drop the filler.

The synthesizer is deliberately picked from a model not in the source list when possible. The prompt is straightforward: "Here are responses from four models. Write one definitive answer. Take the best from each."

In practice, the synthesized result usually has better structure than any individual response and catches details that at least one model missed.

The Stack

It's about 1,500 lines of TypeScript. Dependencies are minimal:

  • @modelcontextprotocol/sdk for the MCP protocol
  • zod for input validation
  • Node 18+

That's it. No Express, no database, no build framework beyond TypeScript's compiler. Every tool input is validated with Zod schemas, and all logging goes to stderr (stdout is reserved for the JSON-RPC protocol - send anything else there and you break MCP).

Setup

The whole thing takes about five minutes:

  1. Set up CLIProxyAPI and/or Ollama as backends
  2. Clone, install, build HydraMCP
  3. Add your backend URLs to .env
  4. Register with Claude Code: claude mcp add hydramcp -s user -- node /path/to/dist/index.js
  5. Restart Claude Code, say "list models"

From there you just talk naturally. "Ask GPT-5 to review this." "Compare three models on this approach." "Get consensus on whether this is thread-safe." Claude Code routes it through HydraMCP automatically.

What I'd Like to Add

The provider interface makes this extensible by design. The backends I want to see next:

  • LM Studio for another local model option
  • OpenRouter for pay-per-token access to models you don't subscribe to
  • Direct API keys for OpenAI, Anthropic, and Google without needing CLIProxyAPI

Each one is roughly 100 lines of TypeScript. Implement the three interface methods, register it, done.

Why This Matters

The real value isn't any single feature. It's the workflow change. Instead of trusting one model's opinion, you can cheaply verify it against others. Instead of wondering if GPT or Claude is better for a specific task, you can just compare them and see.

Different models have genuinely different strengths. I've seen GPT-5 catch performance issues that Claude missed, and Claude suggest architectural patterns that GPT didn't consider. Gemini sometimes gives the most thorough analysis. Local Qwen is surprisingly good at explaining why something is wrong, not just what is wrong.

Having all of them available from one terminal, with parallel execution and structured comparison, changes how you think about using AI for code. It goes from "ask my preferred model" to "ask the right model for this task" - or just ask all of them and see what shakes out.

The Repo

HydraMCP on GitHub

MIT licensed. If you have subscriptions collecting dust or local models sitting idle, this puts them to work. And if you want to add a provider, the interface is documented and the examples are there.

Top comments (0)