Werner Kasselman

Posted on Jun 14

Going Remote, Without Going Reckless: Multi-LLM Orchestration and the New Front Door in llm-cli-gateway 2.9.0

#ai #llm #cli #opensource

The earlier posts in this series were about what the gateway lets you call (cache-aware spawning across five providers, the Codex review gate, the CLI-versus-API argument) and the one before this was about the parts that do not show up as a tool, the upstream-tracking and the website that became the project's front door. This one is about a different kind of front door: the gateway can now listen over HTTP, behind real authentication, and serve more than one caller without those callers being able to read each other's work. That sounds like a small toggle. It is not. Moving an MCP server off a local pipe and onto a network port changes the trust boundary completely, and 2.9.0 is the release where I sat down and remediated all seventeen findings from a multi-LLM red-team of exactly that surface before telling anyone the remote path was ready.

Short version: llm-cli-gateway is one Model Context Protocol server that wraps five vendor CLIs (Claude Code 2.1.177, Codex 0.139.0, Google's Antigravity agy 1.0.8, Grok 0.2.51, and Mistral Vibe 2.14.1) behind a single, uniform tool surface, so one orchestrating agent can fan a task out to several models, collect independent opinions, run a red-team or a consensus check, and keep durable session and job state across all of it. Until recently that only made sense on localhost over stdio. As of 2.9.0 the same server runs over HTTP with a static bearer token or a built-in OAuth 2.0 authorisation server (PKCE on by default, an opt-in human-consent gate, and a trusted-principal-header seam for when you front it with your own identity-aware proxy), every session and job and stored request is stamped with an owner principal and access is enforced per principal, remote provider calls are refused unless a workspace is registered, and the whole thing fails closed rather than open when the configuration is dangerous.

Long version is below, same shape as always: what it enables, what the remote options actually are, a stack of worked scenarios, and the caveats named up front rather than buried at the bottom.

What the gateway actually enables

Before the remote story, it is worth being precise about what the thing does, because "multi-LLM orchestration" is the kind of phrase that means nothing until you name the tools.

Each provider gets two tools, a synchronous one and an asynchronous one: claude_request and claude_request_async, and the matching pairs for codex, gemini, grok, and mistral. Codex additionally exposes codex_fork_session, because forking a reasoning session into a branch is a Codex-specific capability worth surfacing rather than hiding. An orchestrating agent (Claude Code on your laptop, say) calls these the way it would call any other MCP tool, and the gateway handles the spawn, the timeout, the output-size cap, the retry-with-circuit-breaker, the token accounting, and the structured error.

On top of the raw provider calls there is a validation layer, and this is the part I reach for most: validate_with_models sends the same claim to several models independently and reports where they agree and where they do not; second_opinion is the one-model version; red_team_review runs an adversarial pass; consensus_check asks whether a set of models actually agrees on a specific claim rather than vaguely nodding along; compare_answers does a local diff with no provider calls at all; synthesize_validation runs a judge over the collected results; and ask_model is the deliberately simple "just ask one model this" entry point. The reason this matters is that a single model's confident answer is a sample of one, and on anything that carries risk (a migration, a security claim, an architecture decision) I would rather pay three-to-four times the tokens and get three independent reads, two of which from a different vendor family entirely, than trust one well-phrased paragraph.

Underneath all of that sits the state. Sessions persist conversation continuity per CLI (Claude's --continue, Codex's exec resume, Grok's --resume, and the rest), stored minimally with no conversation content on disk. Async requests are durable: a synchronous call auto-defers at 45 seconds (configurable via SYNC_DEADLINE_MS) and hands you a job id, the job runs to completion in the background, and you collect it later with llm_job_status and llm_job_result, or cancel it with llm_job_cancel. The job store is SQLite by default (Node's built-in node:sqlite, no native module to compile), results are retained for thirty days, and an identical re-issued call inside a one-hour window reattaches to the live job rather than starting a second one. Frankly, that durability is the unglamorous feature that saves the most grief: a long red-team sweep that runs for twelve minutes does not vanish because your polling wrapper timed out or your orchestrator restarted. Every request, sync or async, is also logged to a flight recorder (logs.db) with timing, token usage, cache statistics, the approval decision, and the exit code, with secrets redacted before they are written.

Token optimisation (opt-in prompt and response compression) and an approval gate (approvalStrategy:"mcp_managed", which scores each request and can deny it) round it out. None of this is exotic. It is the plumbing you would build anyway if you wanted to orchestrate five CLIs seriously, written once, in one place, so you do not rebuild it five times.

Why remote is the genuinely hard part

On localhost over stdio, the trust model is trivial: there is one user, it is you, the pipe is private to your process tree, and the only principal that exists is "the person sitting at this machine." The gateway has always called that principal local, and a local caller can see everything, because everything is theirs.

The moment the server binds a TCP port, every one of those assumptions evaporates. Now there can be more than one caller, the callers may not trust each other, the bytes arriving on the socket are untrusted until proven otherwise, and "can this caller read this session?" stops being a question with an obvious answer. I know that the temptation, and I have seen plenty of projects give in to it, is to ship the HTTP transport with a "set a token if you like" note and call authorisation someone else's problem. I did not want to do that, because the gateway spawns real CLIs that read and write real files, and a remote surface that leaks one tenant's jobs into another tenant's view, or lets an unauthenticated caller trigger a provider spawn against your repository, is not a convenience, it is a liability with a friendly README.

So before documenting the remote path as ready, I ran an extensive red-team of the external and internal MCP surface across four models (Claude orchestrating, with Codex, Grok, and Mistral reviewing independently, each verifying claims against the code rather than accepting a summary), consolidated it down to seventeen confirmed findings, and remediated every one of them. 2.9.0 is that remediation. The numbered findings below (F1, F3, F14, and so on) are the internal labels, kept here only because they map cleanly onto the features.

The new remote options, concretely

Transports. The default is still stdio, and if you never set anything you get exactly the local behaviour you had before. Set the transport to HTTP and the gateway stands up a Streamable-HTTP MCP endpoint, binding to 127.0.0.1 on port 3333 at path /mcp by default (LLM_GATEWAY_HTTP_HOST, LLM_GATEWAY_HTTP_PORT, LLM_GATEWAY_HTTP_PATH). There is also a dormant Agent Client Protocol transport whose foundations shipped in 2.8.0, off by default, which I will write about separately once the later phases land.

Static bearer authentication. The simplest remote auth: set LLM_GATEWAY_AUTH_TOKEN, and every request must present it as a bearer token, compared in constant time so the check does not leak the token through its own timing. This is the right choice for a single-user remote setup (your own server, reachable only by you).

A built-in OAuth 2.0 authorisation server. For anything multi-client, the gateway can be its own OAuth server, configured under [http.oauth] in ~/.llm-cli-gateway/config.toml. It exposes /oauth/authorize, /oauth/token, /oauth/register, and the /.well-known/oauth-protected-resource metadata an MCP client reads to configure itself. PKCE is required by default (require_pkce = true), plain PKCE is off, public clients are off, the client-registration policy defaults to the conservative static_clients (with shared_secret and a dev-only open_dev available when you opt in), and tokens live for an hour by default. The point of the built-in server is that an MCP client which speaks OAuth can discover the metadata, register or use a pre-configured client, run the authorisation-code flow, and start calling tools, without you bolting on a separate identity product just to try the remote path.

An opt-in human-consent gate (F14b). An authorisation-code flow that mints a token the instant a client asks, with no human in the loop, is the kind of thing that looks fine until someone points out that "the client asked nicely" is not authentication of the resource owner. So there is now an optional consent gate: set require_consent = true (or LLM_GATEWAY_OAUTH_REQUIRE_CONSENT=1) with a dedicated, scrypt-hashed consent password, and before any code is issued the operator has to approve at a small consent screen, protected against cross-site request forgery with a double-submit cookie. It is off by default, because not every deployment wants a human in the loop, but it is there for the ones that do.

A trusted-principal-header seam (F14a). Plenty of people already run a proper identity-aware proxy in front of their services (an OAuth2 proxy, an identity-aware gateway, your own front door) and would rather terminate identity there than have the gateway reinvent it. For them there is a seam: set LLM_GATEWAY_TRUSTED_PRINCIPAL_HEADER to the name of a header your proxy injects, and the gateway will adopt that header's value as the calling principal, but only when the request also authenticated as the static gateway bearer, and only when the value matches a tight character set so it cannot smuggle anything into the logs. This is the deliberately IdP-agnostic part of the design: the gateway ships the seam, you bring whatever identity provider you already trust, and the gateway never needs to know which one it is.

Fail-closed posture (F17). If you enable OAuth with public clients or the dev-only open registration and then try to bind the server to a non-loopback address, the gateway refuses to start without an explicit operator override. It will not let you accidentally expose an open registration endpoint to the internet, and it no longer trusts the Host header to decide whether it is "really" on loopback. The default for a dangerous combination is to stop, not to shrug and serve.

Per-principal isolation, and an honest limit (F3)

Authentication tells you who is calling. Isolation decides what they can see, and the two are not the same thing. In 2.9.0 every session, every async job, and every persisted request is stamped with an owner principal at creation, and the read and mutate paths enforce it: a caller can see its own rows, a local caller can additionally see the legacy rows that predate isolation (those have no owner and would otherwise become invisible), and that is all. session_list is filtered to the caller, session_get and session_delete and the job and request-readback tools resolve the owner and return "not found" rather than someone else's data, and a remote OAuth client never sees a pre-isolation row at all.

Here is the honest limit, named up front: a single shared static bearer token is one principal, by design. If five engineers all authenticate with the same LLM_GATEWAY_AUTH_TOKEN, the gateway correctly treats them as one principal (gateway-bearer) and they share a view, because from the server's side they are indistinguishable. Genuine multi-tenancy, where each human is their own principal, needs either an OAuth client per user or the trusted-principal-header seam behind an identity-aware proxy. The static bearer is not a tenancy boundary, and 2.9.0 documents that.

Workspace gating for remote calls

A local stdio caller can pass any workingDir it likes, because it is you, on your machine, and the working directory is your own. A remote caller is a different proposition entirely, and letting an HTTP request name an arbitrary directory to spawn a provider against is precisely the sort of thing a red-team finding is made of. So remote provider requests (HTTP transport, or an OAuth-authenticated call) are refused unless the work is anchored to a registered workspace: a named alias from the [workspaces] registry, a workspace bound to the session, or a configured [workspaces].default. The registry itself is operator-controlled, and the tools that mint new workspaces (workspace_create, workspace_register_existing_repo) require both an environment opt-in (LLM_GATEWAY_WORKSPACE_ADMIN=1) and a workspace:admin OAuth scope, and can only operate inside explicitly allowed roots. Local stdio keeps the prior unrestricted behaviour, because there the restriction would buy nothing.

What the same release did to the approval gate (F15)

Two more things landed in 2.9.0 that are worth a paragraph, because they change a default. First, under approvalStrategy:"mcp_managed" a full permission or sandbox bypass request is now denied by default regardless of how the heuristic scored it, unless the operator explicitly opts back in. Second, and this is the behaviour change to be aware of, the managed strategy no longer force-sets each provider to its most permissive mode. It used to push Claude to bypassPermissions, Gemini to yolo, Grok to --always-approve, and Mistral to auto-approve on every single managed request, which meant a child could obtain a fully unlocked provider through the managed path without ever asking for a bypass. Now each provider defaults to an accept-edits-level mode instead: Claude and Grok to --permission-mode acceptEdits, Mistral to --agent accept-edits, and Gemini to its prompted default (the agy CLI has no accept-edits rung at all, so the safe default is prompted execution, and that is the honest trade-off, Gemini cannot auto-approve mutating tools under the managed strategy unless you opt in). Setting LLM_GATEWAY_APPROVAL_ALLOW_BYPASS=1 restores the old full-auto-approve behaviour across the board, for the headless operator who genuinely wants it. Secrets are also redacted from the flight recorder now (F4), so prompts and responses stored for tracing no longer carry tokens or keys in the clear.

Worked scenarios

Abstract capability lists are easy to nod along to and hard to act on, so here are concrete situations where this actually earns its keep.

Scenario one: the solo developer's three-reviewer loop, all local. You are writing something security-sensitive on your laptop, an auth handler, say. You have Claude Code open as your orchestrator, with the gateway configured over stdio. You implement with codex_request (full-auto, sandboxed workspace-write), then fan the result out to three independent reviewers in parallel with claude_request_async, gemini_request_async, and grok_request_async, each told to end with PASS or FAIL and findings. You poll every sixty seconds, union the findings, and treat anything two reviewers independently flag as high-confidence. No network, no auth, no tenancy questions, just five CLIs orchestrated from one place, and the durable jobs mean a fifteen-minute Codex pass survives you wandering off to make coffee. This is the case the gateway has always served, and remote changes nothing about it.

Scenario two: a small team, each engineer their own principal. Now there are four of you, and you want a shared gateway running on a box in the corner so everyone benefits from one warmed-up configuration and one flight recorder, but you absolutely do not want Jess's sessions showing up in Soumik's session_list. You front the gateway with the identity-aware proxy you already run, terminate each engineer's identity there, and have the proxy inject the authenticated username into the header named by LLM_GATEWAY_TRUSTED_PRINCIPAL_HEADER, with the proxy-to-gateway hop authenticated by the static bearer. Now each engineer is their own owner principal, isolation is enforced per person, and the gateway never had to learn the first thing about your identity provider. The bit that makes this safe rather than for show, is that the header is only trusted on the bearer-authenticated hop, so a caller cannot simply set the header themselves and impersonate a colleague.

Scenario three: calling the gateway from a phone, over OAuth, with the consent gate on. You want to reach your own orchestration server from an MCP client on a tablet while you are away from the desk. You enable the built-in OAuth server, set LLM_GATEWAY_PUBLIC_URL so the issuer metadata is correct, turn on require_consent with a consent password only you know, and put the server behind TLS (the fail-closed posture will stop you binding an open-registration server to a public address by accident anyway). The client discovers the metadata, runs the PKCE authorisation-code flow, and pauses at the consent screen; you approve it once, from your phone, and the token it receives lives for an hour. Every job that client creates is owned by that client's principal, so even if you later add a second client it cannot read the first one's history. This is the scenario the consent gate exists for: a token is not minted just because something on the network asked politely, a human approved it.

Scenario four: durable async jobs from CI or a cron, headless. You want a nightly job that asks three models to review the day's merged diffs and files anything serious. A headless runner has no human to approve per-action prompts, so you set LLM_GATEWAY_APPROVAL_ALLOW_BYPASS=1 deliberately (you are the operator, you know the box is yours, and you want full-auto), fire the three *_request_async calls, and let the run end. The jobs are durable for thirty days, so a separate collector step hours later pulls the results by job id, and if the nightly job and a manual run happen to issue an identical request inside the dedup window they reattach to the same job rather than paying for it twice. The flight recorder gives you the token spend and timing per call afterwards, with no secrets in it.

Scenario five: a full red and blue cycle on a risky change. You have written something you are nervous about. You run the red-team-assessment flow: dispatch the same target to Claude (architecture and trust boundaries), Codex (logic bugs and races), Gemini (known vulnerability classes and dependency CVEs via the research tools), and Grok (the independent perspective that contradicts the other three's shared assumptions), each ending in PASS or FAIL. On a FAIL you send the findings to a different model for the blue-team response (the defender should not be the model that found the attack), implement the fixes, and re-dispatch to the original red-teamers until they all pass. Worth knowing in 2.9.0: under the managed strategy these reviewers now get accept-edits-level access by default, which is plenty for reading code with Read, Grep, and Glob plus the sqry, exa, and ref_tools MCP servers, but a reviewer that needs to run shell commands headlessly needs the operator opt-in. For pure analysis, the safer default is the right default.

Scenario six: cross-vendor consensus before a one-way-door decision. You are about to commit to a database migration that is genuinely hard to reverse. Rather than ask one model and convince yourself, you put the migration plan through validate_with_models across all five providers and then consensus_check on the specific claim that matters ("this migration is online-safe and reversible within the deploy window"). Where the five agree, you have a real signal; where one dissents, you have found the question you had not asked yet. synthesize_validation gives you the judge's-eye summary over the lot. I know this sounds like over-engineering a decision, but the asymmetry is the whole point: the consensus pass costs a few minutes and some tokens, and the migration going wrong costs a weekend, so the trade is not close.

Caveats, named up front

The static bearer is one principal, not a tenancy boundary; if you need per-user isolation you need OAuth clients per user or the proxy-plus-header seam, and sharing one token means sharing one view, full stop. The built-in OAuth server is a real authorisation server but it is not a substitute for an enterprise identity platform, and the trusted-principal-header seam exists precisely so you can put your real one in front; the gateway is opinionated about being IdP-agnostic. The consent gate and the bypass opt-in are both genuinely off and on switches with real consequences, so read what they do before flipping them. The Postgres persistence backend is an interface only today, not a shipped implementation; SQLite is the durable default and memory is the ephemeral one you must explicitly acknowledge. And the accept-edits default under the managed strategy is a behaviour change from 2.8.0, so if you were relying on managed mode handing your providers a full bypass, that now needs the explicit opt-in. These are the sharp edges, and I would rather you meet them in a list here than in production.

What's next

The Agent Client Protocol transport is in the tree but dormant, and the later phases of that work are where my attention goes next; the per-provider accept-edits tightening is complete across all five; and the remote surface, having been red-teamed down to zero open findings, is now the thing I would actually trust to face a network. As ever, the honest position is that wrapping five vendor CLIs that each move on their own cadence is a moving target, and the work is never finished so much as kept current. But the front door is real now, it locks, and it knows who is knocking.

The gateway is on npm as llm-cli-gateway (2.9.0 is current), the source and signed installer artefacts are on the public mirror, and the website at llm-cli-gateway.dev is built so an MCP client can read one URL and configure itself. If you orchestrate more than one model, or you are about to, I think it is worth the ten minutes.

DEV Community