I built an OpenAI-compatible server that routes to whichever AI CLI you have installed

Jeanfrancois Arcand — Tue, 10 Mar 2026 14:42:15 +0000

I've been building dev tools that need LLM completions. The usual path is: pick a provider, wire up their SDK, manage API keys, handle their specific response format. Repeat for each provider you want to support.

But I already had claude code, copilot, and codex installed on my machine. They're authenticated, they have access to the models, they even have their own billing. Why am I signing up for separate API keys?

So I tried something different: what if I just spawned those CLI tools as subprocesses and parsed their output?

The idea

Every AI coding CLI emits structured output in some format — JSON, NDJSON, JSONL — when you ask nicely. For example:

# Claude Code
claude -p "hello" --output-format json

# GitHub Copilot
copilot -p "hello"

# Codex
codex exec "hello" --json --full-auto

The output varies wildly between tools, but the information is the same: a text completion, maybe a model name, maybe token counts. So I wrote parsers for each one and put a common trait on top.

That became embacle, a Rust library. But honestly, most people don't need the Rust library directly. What turned out to be more useful is what sits on top of it.

embacle-server: drop-in replacement for OpenAI's API

embacle-server is an OpenAI-compatible HTTP server. You start it, point your existing OpenAI client at it, and it routes requests to whichever CLI tool you want.

cargo install embacle-server
embacle-server --provider claude_code --port 8080

Now any OpenAI client works against it. Python, TypeScript, curl — doesn't matter:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude:opus",
    "messages": [{"role": "user", "content": "hello"}]
  }'

You get back a standard OpenAI-shaped response. SSE streaming works too — same data: [DONE] protocol.

Important caveat: this is not free. The CLI tools use your existing subscriptions. If you're paying for Claude Code or Copilot, embacle lets your other applications use those same tokens through a standard API. It doesn't bypass billing — it reuses the access you already have.

Model routing

The model field determines which backend handles the request. Prefix with the provider name to be explicit:

# Route to Claude Code
{"model": "claude:opus", "messages": [...]}

# Route to Copilot
{"model": "copilot:gpt-4o", "messages": [...]}

# Route to Gemini
{"model": "gemini:gemini-2.5-pro", "messages": [...]}

Or pass a bare model name and the server's default provider handles it.

Multiplex: fan out to multiple providers

This one surprised me by being useful. You can send the same prompt to multiple providers simultaneously:

curl http://localhost:8080/v1/chat/completions \
  -d '{"model": ["claude:opus", "copilot:gpt-4o"], "messages": [...]}'

You get back an array of responses. Good for comparing outputs or doing consensus-based validation.

Tool calling (even when the CLI doesn't support it)

Here's where it gets interesting. Most CLI tools don't have native function calling. They just emit text. But the OpenAI API expects tool_calls in the response.

embacle bridges this with a text-based simulation layer. It injects a tool catalog into the system prompt, asks the LLM to emit <tool_call> XML blocks, parses them out, and returns proper OpenAI-shaped tool_calls in the response. The caller never knows the difference.

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
    }
  }],
  "tool_choice": "auto",
  "messages": [{"role": "user", "content": "What's the weather in Montréal?"}]
}

This works with any of the 10 supported CLI backends. The simulation isn't perfect — it depends on the LLM following the XML format — but in practice it works reliably with Claude Code, GPT-5.3-codex, and Github Copilot.

embacle-mcp: MCP server

If you're in the MCP ecosystem (Claude Desktop, Cursor, etc.), embacle-mcp exposes the same providers via JSON-RPC 2.0 over stdio or HTTP/SSE. Install it, add it to your MCP config, and you can use any of the 10 CLI tools as an MCP tool server.

cargo install embacle-mcp
embacle-mcp --transport stdio --provider claude_code

For Rust developers

OK, if you've read this far and you write Rust, here's the library-level view.

All 10 runners implement the same trait:

#[async_trait]
pub trait LlmProvider: Send + Sync {
    fn name(&self) -> &'static str;
    fn default_model(&self) -> &str;
    async fn complete(&self, request: &ChatRequest) -> Result<ChatResponse, RunnerError>;
    async fn complete_stream(&self, request: &ChatRequest) -> Result<ChatStream, RunnerError>;
    async fn health_check(&self) -> Result<bool, RunnerError>;
}

Using it is straightforward. First, add the dependency:

[dependencies]
embacle = "0.8.1"
tokio = { version = "1", features = ["full"] }

Then pick a runner and call complete():

use std::path::PathBuf;
use embacle::{ClaudeCodeRunner, RunnerConfig};
use embacle::types::{ChatMessage, ChatRequest, LlmProvider};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RunnerConfig::new(PathBuf::from("claude"));
    let runner = ClaudeCodeRunner::new(config);

    let request = ChatRequest::new(vec![
        ChatMessage::user("Respond with exactly: PING_OK"),
    ]);

    let response = runner.complete(&request).await?;
    println!("{}", response.content);
    Ok(())
}

Under the hood, this spawns claude -p "Respond with exactly: PING_OK" --output-format json, reads stdout, parses the JSON, and gives you back a typed ChatResponse. If the process times out, exits non-zero, or emits garbage, you get a typed RunnerError — not a panic.

Because every runner implements the same trait, you can swap providers, build fallback chains, add metrics decorators, or run structured output extraction on top — all through the same interface. The docs.rs page has working examples for each of these.

The tradeoffs

I should be honest about what this approach gives up:

Latency. Spawning a subprocess per request is slower than a persistent HTTP connection. For interactive use it's fine. For high-throughput batch processing, you'll feel it.
No streaming everywhere. Some CLIs don't support streaming output. embacle falls back to buffered mode, but you won't get token-by-token delivery from every provider.
Depends on installed tools. If claude isn't on the machine, the Claude runner won't work. The discovery module probes for binaries at startup, but you're still coupled to what's installed.
Tool simulation is best-effort. The XML-based tool calling works well in practice but isn't a guarantee. If the LLM doesn't follow the format, the parse fails and you get the raw text back.

For my use case — an internal platform where multiple services need LLM completions and the machine already has authenticated CLI tools — the tradeoffs are worth it. Zero API key management, one interface, and I can switch between Claude and Copilot by changing a string.

Source: github.com/dravr-ai/dravr-embacle — Apache-2.0.

Install from source with cargo install embacle-server (or embacle-mcp), or grab a prebuilt binary from the GitHub releases page.

DEV Community: Jeanfrancois Arcand