DEV Community: Lynkr

Stuffing the Context Window Is Making Your Agent Dumber: What the Research Says

Lynkr — Fri, 17 Jul 2026 01:10:15 +0000

Disclosure: I maintain Lynkr, an open-source gateway that (among other things) compresses agent tool outputs — so I have a horse in this race. This piece, though, is about the research, and every number in it is cited to its primary source.

There's an intuition almost every LLM user shares: more context = better answers. Million-token context windows are marketed as a capability. We paste in whole files "just in case." Our coding agents accumulate every grep result, every file read, every test log, on the theory that the model might need it.

The research says this intuition is not just wrong — it's backwards, and the size of the effect is embarrassing.

The result that should change how you build

The cleanest demonstration comes from the Hindsight memory system (arXiv:2512.12818, demo at ACL 2026). On LongMemEval — a benchmark of questions over long conversational histories — the same open-source 20B model scores:

39.0% when handed the full context — everything, the whole history
83.6% when handed a curated slice selected by a structured memory system

Same model. Same available information. The difference is that one setup made the model read everything, and the other selected what mattered. +44.6 points from subtraction.

It gets more uncomfortable: that 20B model with curated context also beats full-context GPT-4o, which scores 60.2% on the same benchmark. A model a fraction of the size, winning because someone cleaned its desk. As the authors put it, the memory architecture — not model scale — drives the performance.

One benchmark, one paper? No — this is a pile-on:

"Lost in the middle" (Liu et al., 2023) established the shape of the problem early: models attend well to the start and end of long contexts and poorly to the middle — exactly where your agent's fifteenth tool result lives.
Context-rot studies (notably Chroma's 2025 report) showed performance degrading as context grows even when the added tokens are relevant, and degrading faster when they're distractors.
A whole 2026 research wave now treats context as a resource to be managed, not maximized: "Agentic Context Engineering" was accepted at ICLR 2026, active context compression systems prune their own working memory (arXiv:2601.07190), and two consolidating surveys (arXiv:2512.13564, arXiv:2603.07670) formalize memory as a write–manage–read loop — with "manage" doing the heavy lifting.

The marketing said "bigger window." The research says "better librarian."

Why more context makes things worse

Three mechanisms, all well-documented:

1. Attention is a budget, not a spotlight. Every token in context competes for attention mass. Pack in 50k tokens of tool output and the three lines that matter are now competing with 49,900 tokens of noise. Needle-in-a-haystack benchmarks — the ones vendors publish — test retrieval of a planted string, which models are good at. Real tasks require reasoning over the context, which degrades much faster.

2. Distractors don't just dilute — they actively mislead. The context-rot findings show semantically-similar-but-irrelevant content is worse than random filler. Your agent's context is full of this: old versions of the file it's editing, error messages from an already-fixed bug, grep hits from a deprecated module. Each is a plausible-looking wrong answer sitting one attention head away.

3. Position effects compound over turns. Agents append. Every turn pushes the important early material (the task! the constraints!) toward the middle of the context — the attention dead zone — while burying the recent signal under boilerplate tool output. A long agent session is a machine for constructing worst-case attention layouts.

Coding agents are the pathological case of all three at once: they generate enormous, distractor-dense, structurally-repetitive context (JSON tool results, file dumps, test logs) at machine speed, across dozens of turns.

What the research says works instead

The successful systems in the literature share a shape — they spend compute deciding what the model sees instead of showing it everything:

Selection over inclusion. Hindsight's four memory networks (facts vs. experiences vs. summaries vs. beliefs) exist so retrieval pulls the right kind of memory for each question. The general lesson: retrieval into a small context beats residence in a big one.
Compression as a first-class operation. Active-context-compression agents treat "shrink my working set" as an action the agent itself takes, on par with tool calls. Summarize the resolved, drop the superseded, keep the live.
Structure beats soup. Tabular, labeled, deduplicated context consistently outperforms raw dumps of the same information — the model spends attention on content, not parsing.

What you can do about it today

You don't need a research memory system to benefit:

Treat context as a liability with interest, not an asset. Every tool result you leave in the window is re-read (and re-billed) every subsequent turn, while making each turn slightly dumber.
Compact aggressively and early. Whatever your agent's compaction/clear mechanism is (/compact, /clear, fresh sessions per task), use it before quality degrades — by the time you notice the agent going in circles, the context has been hurting you for many turns.
Scope sessions to tasks. One task, one session. The 40-turn omnibus session is the exact scenario the position-effect research warns about.
Isolate research from execution. Subagents (or separate sessions) that read a lot and report a little are context firewalls: the summary crosses over; the 30k tokens of grep output don't.
Compress tool outputs before they enter context. Test logs, JSON blobs, and directory listings compress 40–90% with zero information the model actually needs lost. Whether you do it with a proxy layer (this is the part where I mention that's what Lynkr does), a harness setting, or a wrapper script — do it somewhere.

The next time a model launch leads with context-window size, remember the 20B model with a good librarian beating GPT-4o with a hoard. Capacity isn't capability. Curation is.

Primary sources: Hindsight (arXiv:2512.12818) · benchmark data · Lost in the Middle (Liu et al.) · Active Context Compression (arXiv:2601.07190) · Memory in the Age of AI Agents — survey (arXiv:2512.13564) · Memory for Autonomous LLM Agents — survey (arXiv:2603.07670)

How We Built an Agentic-Task Detector for LLM Routing

Lynkr — Tue, 14 Jul 2026 04:18:47 +0000

Disclosure: I maintain Lynkr, the open-source LLM router whose agentic detector this post dissects. Every snippet below is real, shipping code — read the whole file here, it's 350 lines.

"Fix the auth bug in session.js."

Eight words. Every token-count heuristic on earth routes this to the small, cheap model — it's short. And every one of them is wrong, because those eight words are about to unleash a grep → read → edit → test loop with exact-string file edits, the precise workload where small models fumble tool calls and kill sessions.

The inverse request — three paragraphs asking for a detailed comparison of locking strategies — looks expensive and routes safely to a free local model, because it's pure text generation. Size and stakes are nearly uncorrelated in coding-agent traffic. So the router's real job is detecting agentic intent, and this post is a tour of how Lynkr's detector does it: the signals, the weights, the classification ladder — and the embarrassing false positive that almost made the whole thing useless.

Not "agentic: yes/no" — a ladder

The first design decision: agentic-ness isn't boolean. The detector classifies requests into four types, each with a minimum tier floor and a score boost fed into the complexity scorer:

const AGENT_TYPES = {
  SINGLE_SHOT: { minTier: 'SIMPLE',    scoreBoost: 0 },   // request-response, no tools
  TOOL_CHAIN:  { minTier: 'MEDIUM',    scoreBoost: 15 },  // read -> edit -> test
  ITERATIVE:   { minTier: 'COMPLEX',   scoreBoost: 25 },  // retry loops, debugging cycles
  AUTONOMOUS:  { minTier: 'REASONING', scoreBoost: 35 },  // "figure it out", full autonomy
};

The minTier is a floor, not a suggestion: even if every other dimension scores low, an ITERATIVE request cannot route below the COMPLEX tier. Mid-debugging-loop is the worst possible moment to hand the session to a 7B model.

The six signals

Each request accumulates a score from six independent signals. The interesting part is why each one exists:

1. Tool count (up to +25). Many tools attached usually means the client is prepared for multi-step work. Usually. This signal is also the source of the great false positive — hold that thought.

2. Agentic tools specifically (up to +25). Not all tools are equal evidence. Bash, Write, Edit, Task, git and test runners form an explicit set — these mutate state, and their presence signals mutation work. A request that can only Read/Grep/WebSearch sits in a separate read-only set and earns nothing here. Two requests with five tools each can be night and day.

3. Prior tool results (up to +30 — the heaviest signal). If the conversation already contains tool_result blocks, you're not predicting an agentic loop — you're inside one. More than five results means a deep loop with accumulated exact state (file contents, error strings); downgrading the model now throws away the context discipline keeping that loop convergent.

4. Language patterns (up to +25 each). Regexes over the last user message:

tool-chain: "then use", "after that", "step 2"
iterative: "keep trying", "until", "retry", "debug"
autonomous: "figure out", "make it work", "on your own", "whatever it takes"
multi-file: "across the codebase", "refactor entire", "everywhere"

Plus a combination rule: "implement" alone is +10-ish planning noise, but "implement" and "test/verify/make sure" in the same request is +15 — build-and-verify phrasing is a reliable tell of real work.

5. Conversation depth (up to +20). Fifteen-plus messages means established context and momentum.

6. Prompt length (+10). The weakest signal, deliberately — see the opening paragraph.

Score ≥ 25 → the request is agentic. The classification ladder then applies both thresholds and signal combinations — AUTONOMOUS needs score ≥ 60, or an explicit autonomous phrase with score ≥ 40. A phrase alone doesn't do it; a high score without autonomous language doesn't either, unless it's overwhelming.

The false positive that almost sank it

Early versions had a humiliating problem: every single Claude Code request scored agentic. Including "hello."

Why? Claude Code attaches its full tool loadout — Read, Write, Edit, Bash, Grep, Glob, Task, and friends — to every request, even a greeting. Signals 1 and 2 saw 11+ tools, four of them mutating, on everything. Every request cleared the threshold, every request routed to expensive tiers, and the router's entire value proposition — savings — evaporated. The detector was technically working and practically useless.

The fix ships as client profiles: known harnesses (Claude Code, Cursor, Codex CLI) have documented baseline loadouts, and the tool-count signals score only the tools beyond that baseline:

// Signals 1 & 2 score only tools BEYOND the harness's baseline loadout —
// Claude Code's 11 always-attached tools shouldn't count as "agentic
// intent" on their own.
toolsForScoring = clientProfiles.effectiveTools(payload, profile);

Crucially, signals 3–6 still use the full payload — prior tool results and conversational language are genuine evidence regardless of which harness sent them. Only the tool-presence signals get the subtraction, because only they are polluted by the harness's constant.

And for traffic from harnesses we've never seen? A guard: if every attached tool looks like a standard baseline and there are 10+ of them, the tool-count signals zero out rather than fire:

} else if (clientProfiles.allToolsAreBaseline(payload) && rawTools.length >= 10) {
  // Unknown harness that looks like Claude Code / Cursor / Codex —
  // zero out the tool-count signals to avoid the same trap.
  toolsForScoring = [];
  scoringNote = 'unknown_harness_guard';
}

Better to under-detect and lean on the five uncorrupted signals than to re-create the everything-is-agentic bug for unknown clients.

One subtle consequence, preserved as a comment in the source: with the baseline subtracted, tool counts rarely reach the AUTONOMOUS threshold on their own — so the autonomous phrase pattern becomes the primary path to the top classification. The signal design acknowledges its own post-fix physics.

What it still gets wrong

Honesty section. Known limitations, from the code itself:

It reads only the last user message. "Do what I described above" carries the intent of an earlier message the regexes never see. Conversation-depth and tool-result signals partially compensate — but pattern detection is myopic by design (scanning full history was too noisy).
Regexes can't tell mention from intent. "Why did the retry loop break?" trips the iterative pattern despite being a read-only question. In practice this fails safe — over-routing a question up-tier costs cents, under-routing an edit session down-tier costs the session — but it's still a false positive.
English only. The patterns are English regexes; agentic intent in other languages leans entirely on the structural signals.

Every detection returns its full evidence — score, signal list with weights, classification, and a scoringNote explaining any baseline subtraction — so when the router misjudges, the telemetry shows exactly which signal lied. Debuggability was a design requirement: a routing layer you can't interrogate is a routing layer you'll eventually rip out.

Takeaways if you're building anything similar

Subtract the constant before reading the signal. Whatever your equivalent of "the harness always attaches 11 tools" is — find it and remove it, or every request looks the same.
Separate "prepared for tools" from "already using tools." Attached tools are weak evidence; tool_result blocks in the conversation are near-proof.
Fail toward the expensive model. Asymmetric costs mean your threshold should be calibrated so mistakes over-spend pennies rather than break sessions.
Make the detector explain itself. A score without a signal list is a black box you'll never be able to tune.

The whole detector is 350 lines of dependency-free JavaScript: src/routing/agentic-detector.js. Steal it, or tell me which of your prompts it would misjudge — the failure cases are the roadmap.

Choosing a Local Tier for Your Coding Agent (July 2026 Edition)

Lynkr — Wed, 08 Jul 2026 00:14:16 +0000

Disclosure: I maintain Lynkr, the open-source router used in the config examples. The benchmark figures below are third-party or vendor-reported (flagged where vendor-only) — I haven't independently benchmarked these models yet; the point of this post is to help you match models to request classes and test on your own workload.

June 2026 was the busiest month for open-weight coding models in recent memory: GLM-5.2, MiniMax M3, Kimi K2.7 Code, Gemma 4, and NVIDIA's Nemotron 3 Ultra all landed within weeks. If you route your coding agent's simple requests to a local model — the "cloud architect, local coder" pattern — your options just changed meaningfully.

Here's how I'd map the current field onto routing tiers, by hardware budget and by what each model can safely own.

First, the trap: "best open model" ≠ "your local tier"

The headline model of the month, GLM-5.2, scores 62.1% on SWE-bench Pro — above GPT-5.5. It is also a 744B-parameter MoE whose 2-bit quant alone wants ~245 GB of memory. That's an open-weight model, not a local model; for self-hosters it's a $40k-rig proposition (one published build runs it on four RTX PRO 6000s). The same goes for DeepSeek-V4 Pro and MiniMax M3: superb models you'll realistically consume via API, where they belong in your COMPLEX/REASONING tiers, not your local one.

Your local tier is decided by a harsher question: what fits in your VRAM and still makes reliable tool calls?

The local field, by hardware budget

~16 GB RAM (ordinary laptop): Gemma 4 12B. Released June 3 as a dense 12B that genuinely fits consumer RAM (SitePoint's guide). Apache-2.0-class licensing with no usage clauses. This is a SIMPLE-tier model: explanations, one-liners, commit messages, "what does this error mean." I would not hand it an Edit tool.

24 GB GPU (RTX 3090/4090 class): Qwen3.6-27B — still the default answer. The community's consensus "local Claude" since April: within a few points of frontier models on SWE-bench Verified (77.2 reported vs Claude's 80.9 — analysis), Apache-2.0, runs quantized on a single 24 GB card or a ~$2k build. Its known weakness is exactly the one that matters for agents: tool-call reliability drifts in long contexts — fine as a supervised MEDIUM tier, risky as an unsupervised COMPLEX one.

Agentic multi-file edits on similar hardware: Devstral Small 2. Purpose-built for multi-file, tool-driven coding rather than chat (KDnuggets roundup). If your traffic is edit-heavy, it can arguably take MEDIUM-tier mutation requests that I'd keep away from general chat models.

Autocomplete-shaped work: Codestral 22B is fast and good at it — but mind the non-commercial license before using it for work.

One rule that keeps proving out (Pinggy's guide): within the same memory budget, a bigger model at Q4 usually beats a smaller one at Q8. Quantization choice matters nearly as much as model family.

Mapping to tiers

Putting that together into a routing config (Lynkr shown; the mapping logic applies to any router):

# 24 GB GPU + API keys for the hard stuff
TIER_SIMPLE=ollama:gemma4:12b            # trivia, explanations, greetings
TIER_MEDIUM=ollama:qwen3.6:27b           # code questions, supervised edits
TIER_COMPLEX=deepseek:deepseek-v4-flash  # tool-heavy mutations, via API
TIER_REASONING=deepseek:deepseek-v4      # architecture, multi-step planning

Why V4 Flash for COMPLEX: it's the first open-weight model teams report dropping into real agentic pipelines as a frontier substitute on price (OpenRouter's June analysis) — the cheapest "won't break the session" option right now. Kimi K2.7 Code (vendor-reported 58.6% SWE-bench Pro at ~30% fewer reasoning tokens) and GLM-5.2 are strong API-tier alternatives; all the June day-one numbers are vendor-reported, so treat them as directional until LiveBench catches up.

The key discipline: the boundary between MEDIUM and COMPLEX should not be "how big is the request" but "will tools mutate state." Local models in this class handle read-and-explain reliably; exact-match edits and bash execution are where they still break sessions — I wrote up those failure modes here.

What changed vs three months ago

The floor rose. A 16 GB laptop now runs a genuinely useful SIMPLE tier (Gemma 4). Six months ago that tier meant 3B models that couldn't be trusted with a paragraph.
The open-weight ceiling now beats proprietary on some coding benchmarks (GLM-5.2 > GPT-5.5 on SWE-bench Pro) — but at server scale, which strengthens the hybrid pattern: open models via cheap APIs up top, small open models on your metal below.
MoE won. Every serious June release is Mixture-of-Experts. For self-hosters this cuts both ways: better quality-per-active-param, but total memory footprints that keep the top tier out of reach.
Licensing is consolidating around MIT (DeepSeek) and Apache-2.0 (Qwen, Gemma) for the models you'd actually build on.

Test on your traffic, not on benchmarks

Every number above is someone else's workload. The honest way to pick your local tier: route a week of your real traffic through whatever candidates fit your hardware, and count session survival — how often the local model's tool calls held up — not just benchmark deltas. That's a one-line config change per candidate, and your own telemetry will contradict at least one thing this post told you.

Lynkr is Apache-2.0, self-hosted, and treats every model above as a first-class routing tier: github.com/Fast-Editor/Lynkr.

Routing Down Is Easy. Knowing When Not To Is Hard: Why Cheap Models Break Your Coding Agent

Lynkr — Wed, 08 Jul 2026 00:08:36 +0000

Disclosure: I maintain Lynkr, an open-source router whose design decisions this post explains. The failure modes described are patterns widely reported across router issue trackers and local-LLM forums — the examples are representative reconstructions, not captured transcripts. The problem is real either way; ask anyone who's routed a coding agent to a 7B model.

Everyone who gets their first LLM router working does the same thing within the hour: point the expensive coding agent at a free local model and watch the bill drop to zero.

Then the agent tries to edit a file.

The graveyard of downgraded sessions

If you browse the issue tracker of any Claude Code router — or r/LocalLLaMA on any given week — you'll find the same story in a hundred variations. The routing works perfectly. The session dies anyway. The killers, in rough order of frequency:

1. Malformed tool arguments. The agent decides to call Edit, and the model produces arguments that are almost JSON:

{"file_path": "src/auth.js", "old_string": "if (token) {", "new_string": "if (token && !expired) {"

One missing brace. The harness rejects the call, the model retries, produces a different malformation, and you're three turns deep into fixing nothing. Frontier models emit structurally valid tool calls with boring reliability; sub-10B models do it most of the time — and "most of the time," at 30 tool calls per session, means every session breaks.

2. Stale string matching. Edit-style tools require the old_string to match the file exactly. Small models paraphrase from memory instead of quoting — they'll "remember" the line as if (token) { when the file says if (accessToken) {. The edit fails, the model re-reads the file, burns 2,000 tokens, tries again with a different paraphrase. This is the single most reported failure, because it looks like the router's fault and is actually a capability cliff.

3. Hallucinated context. Ask a small model to run tests and it may confidently call Bash with npm test -- --grep "auth" in a repo that uses pytest. It's not being stupid — it's pattern-completing from training data instead of the conversation, because instruction-following degrades faster than fluency as models shrink.

4. The infinite loop. The subtlest one: the model calls Read on the same file five times in a row, or greps, reads, greps the same term again. Weak models lose the thread of what they already know in long agentic contexts. Nothing errors — the session just stops converging while tokens burn.

Here's the uncomfortable part: none of these are the router's bug, and all of them are the router's fault. The router made a judgment — "this request is cheap-model-safe" — and the judgment was wrong.

Why the obvious heuristics misjudge

Most routing setups decide with static rules: token thresholds, keyword lists, scenario slots. These fail in a specific, predictable way: they measure the request's size, not its stakes.

"Fix the auth bug in session.js" is eight words. Every token-based rule on earth routes it to the small model. But those eight words unleash a read-grep-edit-test loop — the exact workload where small models faceplant. Meanwhile, "explain the difference between optimistic and pessimistic locking, with examples" looks expensive (long answer, technical vocabulary) and is actually perfectly cheap-model-safe: it's pure text generation, no tool calls, no exact string matching, nothing to break.

Size and stakes are almost uncorrelated in agentic traffic. That's the whole problem.

What "stakes-aware" routing looks like

When I built Lynkr's router, most of the design ended up being about when not to save money. The parts that matter:

Weight the tools, not just their count. A request where Grep and Read are in play is research — paraphrase-tolerant, failure-tolerant, ideal for a local model. A request where Bash, Write, or Edit will fire is a mutation with exact-match requirements. Lynkr assigns each tool a risk weight (Bash 0.9, Write 0.8, Edit 0.7 … Grep 0.2) and scores the request's effective toolset. Two requests with five tools each can land tiers apart.

Treat mid-session as a signal. If the conversation already contains three tool results, you're inside an agentic flow with accumulated exact-state (file contents, error strings). Downgrading the model mid-flow throws away the one thing that was keeping the loop convergent. Prior tool usage and conversation depth push requests up-tier even when the latest message is short.

Subtract the harness baseline. Claude Code ships ~14 tool schemas with every request — including "hello." Count them naively and everything looks agentic, so nothing ever routes local and you save nothing. Score only the tools the request could plausibly use, and the safe majority routes down while the risky minority stays up.

Some patterns override everything. Greetings and "what does X do" questions force-route local, always. Security-sensitive analysis force-routes to the strong tier, always — a JWT architecture question is short, toolless, and precisely the wrong place to save four cents.

The result on my own traffic: 70–90% of requests route to free local models — but they're the right 70–90%, which is the entire difference between "my bill dropped" and "my agent broke."

Takeaways, router-agnostic

Route research down, mutations up. If your router can't tell a Grep request from an Edit request, it isn't routing — it's gambling on which sessions break.
Never downgrade mid-loop. Model consistency across an agentic sequence is worth more than the marginal savings of one cheap turn.
Measure session survival, not just cost. A routing setup that saves 60% and breaks one session in five is more expensive than the bill it replaced — you're paying in re-runs and rage.
The ceiling is rising. Local models' tool-calling improves every quarter; the set of safely-downgradable requests grows with it. A router with per-tool judgment gets to expand that set gradually. A token threshold has to guess again from scratch.

The router's job was never "pick the cheapest model." It's "pick the cheapest model that won't break the session" — and those five extra words are where all the engineering lives.

The scorer described here is ~1,000 lines of readable Apache-2.0 JavaScript: src/routing/complexity-analyzer.js. Steal the design, or file an issue telling me where it misjudges — the failure cases are the interesting part.

The 5% Router Tax: What Hosted LLM Gateways Charge For (and How to Self-Host It)

Lynkr — Sun, 05 Jul 2026 08:48:56 +0000

Disclosure: I maintain Lynkr, the self-hosted gateway discussed in the second half. OpenRouter and Requesty are good products — this post is about understanding what you're paying for so you can decide whether you need to.

Hosted LLM routers had a huge 2026 — OpenRouter alone pushes 25 trillion tokens a week. The pitch is real: one API key, 400+ models, automatic failover. The price is a ~5% fee on every token you route (5.5% on OpenRouter credits, 5% on Requesty), plus a subtler cost: every prompt, every file your coding agent reads, every secret that leaks into a context window transits their infrastructure.

For a hobby project, 5% of a small bill is nothing and the convenience wins. For an agentic coding workload — where teams routinely spend $500–$2,000 per engineer per month — 5% is real money, and the data-transit question stops being academic. So it's worth asking precisely: what does the hosted router actually do for that fee, and which parts can you self-host?

What the fee buys

Unified API across providers — one format in, translated per-provider out.
Failover — a provider 500s, your request retries elsewhere.
Model marketplace — new models available the day they launch.
Consolidated billing — one invoice instead of six provider accounts.
(Sometimes) smart routing — OpenRouter's auto router picks a model per-request.

Items 1, 2, and 5 are software. Items 3 and 4 are genuinely hard to self-host — if you want day-one access to every new model with zero account setup, the marketplace earns its fee. But most coding workloads use a handful of models, not four hundred.

The parts a hosted router structurally can't give you

Local models as a tier. No hosted router will route your easy requests to the Ollama instance on your own machine — free, private, zero latency to first byte on cached weights. For coding traffic, where (in my instrumented sessions) 70–90% of requests are simple enough for a good local model, this is the single biggest cost lever, and it's only available to something running on your side of the wire.
Your data staying home. Self-hosted means prompts, code, and keys never transit a third party. For anyone with a compliance requirement — or code they'd rather not ship to a router's logs — this isn't a preference, it's a prerequisite.
Token optimization before the bill. A hosted router bills you for the tokens you send it — it has no incentive to shrink them. A self-hosted proxy can strip unusable tool schemas (measured: −53% on tool-heavy requests) and compress JSON tool results (measured: 3,458 → 427 tokens on a grep result) before any provider bills you. That's not a routing saving; it stacks on top of routing.
No availability dependency. Hosted routers go down (OpenRouter's outages have their own HN threads) and offer no SLA at consumer tiers. A local proxy fails independently of anyone's status page.

What self-hosting costs you

Honesty cuts both ways:

You run a process. npm install -g lynkr && lynkr init && lynkr start — but it's yours now: updates, logs, the works.
You manage provider accounts. Two or three API keys instead of one. The consolidated invoice is genuinely gone.
Model lag. A new provider means waiting for support (or a PR) instead of it appearing in a dropdown.
Nobody to email. Self-hosted support is a GitHub issue tracker.

If those trade-offs read as "fine," the math is straightforward: the 5% fee disappears, the local-tier routing removes the easy majority of requests from your bill entirely, and compression shrinks what's left.

The hybrid that actually makes sense

This isn't either/or. A pattern I see working:

Coding tool → self-hosted proxy (Lynkr)
                ├─ SIMPLE/MEDIUM  → local Ollama/llama.cpp   (free)
                ├─ COMPLEX        → direct provider API keys  (no fee)
                └─ exotic models  → OpenRouter               (5% on the long tail only)

Keep a hosted router as one backend for the long tail of models you rarely need, route the bulk directly or locally, and let the proxy's classifier decide per-request. You get the marketplace when you want it without paying the tax on your entire volume.

Lynkr is Apache-2.0, self-hosted, supports 13 providers including Ollama, llama.cpp, LM Studio, Bedrock, Azure, Databricks — and OpenRouter itself as a tier: github.com/Fast-Editor/Lynkr. Benchmarks with methodology are in the repo; run them on your own workload before believing anyone's percentages, including mine.

How a 13-Dimension Complexity Scorer Decides Which Model Gets Your Request

Lynkr — Sun, 05 Jul 2026 08:48:40 +0000

Disclosure: I'm the author of Lynkr, the open-source proxy whose internals this post walks through. All code shown is real and Apache-2.0 — read it here.

The most expensive default in AI coding tools is that model choice is a setting, not a decision. You pick a model once; every request — "what does git stash do?" and "refactor this auth module" alike — goes there. Routing each request to the cheapest model that can actually handle it is worth 50%+ of most bills, but it only works if the "can actually handle it" judgment is reliable. Get it wrong downward and a small model fumbles your file edits; get it wrong upward and you've saved nothing.

Here's how Lynkr makes that judgment, in enough detail that you could reimplement it.

Why not just count tokens?

The obvious heuristics fail in both directions:

"Long request → big model" fails on a 60k-token context that's mostly grep output around a trivial question.
"Short request → small model" fails catastrophically on "fix the auth bug in session.js" — eight words that unleash a tool-heavy agentic session a 7B model will faceplant on.

Token count is one signal. The failure cases all come from treating it as the only one.

The architecture: weighted dimensions, then overrides

Every request gets a 0–100 score from 13 dimensions in four groups. The weights are configurable; these are the defaults:

const DIMENSION_WEIGHTS = {
  // Content Analysis (35%)
  tokenCount: 0.08,
  promptComplexity: 0.10,   // avg sentence length/structure
  technicalDepth: 0.10,     // technical keyword density
  domainSpecificity: 0.07,  // security/ML/distributed/db/frontend/devops
  // Tool Analysis (25%)
  toolCount: 0.08,
  toolComplexity: 0.10,     // which tools, not how many
  toolChainPotential: 0.07, // "first...then", "step 2", sequencing language
  // Reasoning Requirements (25%)
  multiStepReasoning: 0.10,
  codeGeneration: 0.08,
  analysisDepth: 0.07,      // trade-off/comparison markers
  // Context Factors (15%)
  conversationDepth: 0.05,
  priorToolUsage: 0.05,     // tool_results already in the conversation
  ambiguity: 0.05,
};

A few design decisions worth stealing:

Not all tools are equal. A request that can Grep is not like a request that can Bash. Each tool carries a hand-tuned risk weight — Bash 0.9, Write 0.8, Edit 0.7, down to Grep at 0.2. A request whose available toolset averages 0.8 is an agentic mutation session; one averaging 0.25 is read-only research. Same tool count, completely different stakes.

Subtract the harness baseline. Claude Code ships ~14 tool schemas with every request, including "hello". If you count them naively, everything looks agentic and nothing routes local. The scorer subtracts the client's constant baseline and scores only the effective tools the request could plausibly use — one of those fixes that sounds trivial and changed everything.

Conversation history is a signal. Three tool_result blocks already in the conversation means you're mid-agentic-flow — this is not the moment to downgrade models and break the session's momentum. priorToolUsage and conversationDepth push mid-session requests up-tier.

Ambiguity cuts the other way. "file X, line 42, this error" is specific — a small model can act on it. "Something feels slow sometimes" needs interpretation before action. Specificity markers (paths, line numbers, error strings) lower the score.

Overrides: the classifier knows what it can't know

Two pattern lists short-circuit the whole scoring pipeline:

Force-local: greetings, acknowledgments, "what does X do" one-liners. Score 0, never leave the machine, no cloud tokens ever.
Force-cloud: security-critical analysis, architecture decisions, anything matching high-risk patterns. Straight to the top tier regardless of how cheap it looks. A JWT-vs-cookies security question is short and toolless — every naive heuristic routes it local. This is the wrong request to save $0.004 on.

On top of the regex dimensions, an AST pass (tree-sitter) scores actual code structure in the payload — cyclomatic signals beat keyword counting when real code is present.

From score to model

score < threshold        → SIMPLE   (e.g. ollama:qwen2.5:7b, free)
threshold..~65           → MEDIUM   (e.g. ollama:qwen2.5-coder, free)
above                    → COMPLEX  (your API key: Sonnet, GPT-4o...)
reasoning markers heavy  → REASONING (o3, DeepSeek R1...)

The threshold moves with a single mode switch — aggressive (60) routes more local, conservative (25) routes more to the cloud, default is 40. Multi-turn conversations score with a recency-weighted sliding window, so a short follow-up ("now add tests") inherits the complexity of the work it refers to instead of scoring as a trivial one-liner.

Crucially, the classifier only chooses among models you listed. It's not an autonomous agent picking providers — you define the tiers, it picks the tier.

Does it work?

In my instrumented sessions, 70–90% of requests score SIMPLE or MEDIUM and run free on local models, while tool-heavy and security-flagged requests reliably escalate. The failure mode everyone fears — cheap model breaking an agentic session — is exactly what the tool weights, baseline subtraction, and prior-tool-usage dimensions exist to prevent.

Is 13 hand-weighted dimensions the optimal design? Almost certainly not — a learned router trained on outcome data would beat it eventually. But it's transparent (every routing decision logs its per-dimension breakdown), it's tunable, it runs in-process in microseconds, and it never sends your prompts to a third-party classifier API.

The whole thing is readable in one sitting: src/routing/complexity-analyzer.js. Steal the design or use the proxy — either outcome means fewer frontier-model tokens spent on git stash questions.

The 21,000-Token Typo: Where Agentic Coding Budgets Actually Die

Lynkr — Sun, 05 Jul 2026 08:48:24 +0000

Disclosure: I maintain Lynkr, an open-source proxy mentioned at the end. The first 80% of this post is tool-agnostic and the takeaways apply whether or not you ever use it.

There's a documented case of a coding agent burning 21,000+ input tokens to fix a one-line README typo. Not a bug. Not a runaway loop. That's the normal cost structure of agentic coding, and once you see why, you can't unsee it on your own bill.

Stanford's Digital Economy Lab measured it: agentic tasks consume on the order of 1000x the tokens of ordinary code chat, and the same task with the same agent can vary 30x in cost depending on how the session unfolds. Teams running heavy automation report $500–$2,000 per engineer per month. So where does it go?

The anatomy of one "small" agentic task

Say you ask your agent to fix a typo. Here's what actually crosses the wire:

Turn 1: Your one-line prompt... plus the system prompt, plus ~14 tool schemas (Write, Edit, Bash, Grep, Git — a couple thousand tokens before anyone thinks).

Turn 2: The agent greps for the file. The result comes back as JSON — paths, line numbers, match context, metadata. A modest grep is easily 1,000–3,000 tokens. It's now in the context and gets re-sent on every subsequent turn.

Turn 3: The agent reads the file. Add the full file contents to the context. Re-sent every turn from now on.

Turn 4: The edit itself — the cheapest part of the entire session.

Turn 5: Verification: re-read, maybe run a linter, another JSON blob of output.

Five turns, and your one-line fix carried: 5x the tool schemas, 4x the grep results, 3x the file contents. Input tokens dominate output roughly 25:1 in typical sessions. You're not paying for intelligence — you're paying for cargo.

The three structural leaks

Leak 1: Tool schemas on every request. The agent might use two tools this session. You ship fourteen schemas every turn anyway, because the client doesn't know which ones matter. Measured on a realistic Claude Code request: schemas the request couldn't use accounted for 53% of billed input tokens.

Leak 2: Raw JSON in the context window. JSON is the least token-efficient format your context will ever hold — keys repeated per element, quotes, braces, whitespace. A 60-match grep result: ~3,400 tokens raw, 427 after conversion to a tabular token-oriented format with redundant fields stripped. Nothing lost that the model needed.

Leak 3: Frontier models on non-frontier requests. "What does git stash do?" does not need the same model as "refactor this auth module." But your client sends both to the same place, because model choice is a config setting, not a per-request decision. In my instrumented sessions, 70–90% of requests scored as simple or medium complexity — they'd be fine (and free) on a local model.

What to do about it — tool-agnostic

Instrument before optimizing. Log tokens per request by category (schemas / tool results / conversation). You cannot fix a leak you haven't sized. Most people find their intuition about their own spend is wrong.
Never let raw JSON accumulate in a context window. Compact it, tabularize it, or summarize it. Tabular JSON is nearly free compression — same information, a fraction of the tokens.
Keep sessions short and contexts clean. Every tool result you leave in the context is a recurring charge, billed again on every turn until the session ends.
Match model to request, not to workflow. Route the easy 80% somewhere cheap or local; reserve the frontier model for the requests that actually exercise it. Bring your own API keys and the routing is entirely within your control.

The plumbing version

Everything above can be done manually. I got tired of doing it manually, so I built it into a proxy: Lynkr sits between your coding tool (Claude Code, Cursor, Codex CLI) and your providers, strips unusable tool schemas, compresses JSON tool results, caches semantically, and scores each request on 13 dimensions to route it to a tier you define — local Ollama for the easy stuff, your API keys for the hard stuff. Self-hosted, Apache-2.0, no markup, zero client changes.

But the numbers above aren't about my tool. They're about a cost structure every agentic workflow shares. The 21,000-token typo isn't an outlier — it's the default. Measure yours.

Lynkr vs claude-code-router: Static Rules vs a Complexity Classifier

Lynkr — Sun, 05 Jul 2026 08:48:08 +0000

Disclosure: I'm the author of Lynkr. claude-code-router is a genuinely good project that pioneered this category — this is a technical comparison of two different approaches, not a takedown. Where CCR is the better choice, I say so.

If you want to keep the Claude Code harness but route requests to other models, you have two main self-hosted options today: claude-code-router (CCR, ~35k stars, the incumbent) and Lynkr. They solve the same problem with fundamentally different architectures, and which one fits you depends on how much you want to configure versus delegate.

The core difference in one paragraph

CCR routes by scenario rules you write. It has slots — default, background, think, longContext (triggered above a token threshold), webSearch, image — and you assign a model to each. It's predictable, transparent, and entirely under your control.

Lynkr routes by scoring the request itself. Every request gets a 0–100 complexity score computed from 13 weighted dimensions — token count, technical keyword density, tool complexity, multi-step reasoning markers, conversation depth, ambiguity, and so on — and lands in a tier (SIMPLE/MEDIUM/COMPLEX/REASONING) you've mapped to models. You configure the tiers once; the classifier decides per-request.

Where CCR wins

Maturity and ecosystem. 35k stars, ~730k monthly npm downloads, 20+ provider transformers, custom JS plugins, a web UI, and in-session /model switching. If you hit a weird provider quirk, someone has already hit it.
Predictability. A rule is a rule. If you want "long contexts always go to Gemini", CCR expresses that in one line and never surprises you.
Claude Code specialization. CCR does one client deeply. Lynkr supports Claude Code, Cursor, Codex CLI, Cline, and Continue — breadth costs some depth.

Where the rule-based approach breaks down

Browse CCR's issue tracker (~1,000 open issues) and one complaint dominates: tool-calling breakage on downgraded models — failed file edits, broken git operations, agents going in circles. The root cause usually isn't CCR's code. It's that static rules can't see what the request needs:

A short prompt ("fix the auth bug in session.js") looks cheap by token count — but it's an agentic, tool-heavy task that a small local model will fumble.
A long context triggers the longContext rule — but if it's 60k tokens of grep output around a trivial question, an expensive long-context model is wasted money.

Token counts and scenario names are proxies. The thing you actually care about — can a cheap model handle this without breaking the session? — requires looking at the request's structure.

What Lynkr does differently

Three things, all absent from CCR by design (it aims to be a lean router):

1. The complexity classifier. Requests with agentic signals (write/edit/bash tool availability, prior tool results in the conversation, sequential-step language) score into higher tiers even when they're short. Trivia stays local even when the context is long. Force-patterns short-circuit both ways: greetings never hit the cloud; security-critical analysis never gets downgraded. The design goal is exactly the failure mode above — route down only when the answer will still work.

2. Token optimization on the wire. Lynkr strips tool schemas the request can't use (measured: 53% fewer tokens on a realistic 14-tool Claude Code request) and compresses large JSON tool results before they hit the model (measured: 3,458 → 427 tokens on a 60-match grep result). CCR forwards requests as-is.

3. Semantic caching. Paraphrased repeat questions are served from an embedding cache in ~171ms with zero tokens billed.

Honest comparison table

	claude-code-router	Lynkr
Routing logic	Scenario rules + token threshold	13-dimension complexity score → tiers
Configuration	Per-scenario, per-provider (flexible, verbose)	Pick 4 tier models via `lynkr init` wizard
Tool-schema stripping	No	Yes (−53% measured)
JSON tool-result compression	No	Yes (TOON + field stripping)
Semantic cache	No	Yes
Clients	Claude Code (deep)	Claude Code, Cursor, Codex CLI, Cline, Continue
Provider transformers/plugins	20+, custom JS	13 providers built-in
Ecosystem maturity	~35k stars, huge community	Young (~500 stars), one maintainer
In-session model switching	Yes (`/model`)	No (automatic per-request)
License	MIT	Apache-2.0

Which should you use?

You want explicit control and a battle-tested ecosystem → CCR. It's the safe default and its community is unmatched.
You're tired of tuning rules, or your cheap-model sessions keep breaking → try Lynkr. The classifier exists precisely because static rules degrade on agentic workloads.
Your bill is dominated by tool output and repeated context → Lynkr, regardless of routing preference; the compression and caching layers work even if you route everything to one model.

Both are self-hosted, free, and take five minutes to try. Run your own workload through each and compare the token logs — that's the only benchmark that matters. Mine are reproducible here: github.com/Fast-Editor/Lynkr.

Where Claude Code's Tokens Actually Go (and How I Cut My Bill in Half)

Lynkr — Sun, 05 Jul 2026 08:21:25 +0000

Disclosure up front: I'm the author of Lynkr, the open-source (Apache-2.0) proxy discussed below. All numbers come from a benchmark you can reproduce yourself — methodology linked at the end.

I spent a few weeks instrumenting my own Claude Code sessions to answer one question: where do the tokens actually go?

The answer surprised me. It wasn't my prompts. It wasn't even the model's responses. The bulk of my spend was overhead I never looked at:

Tool schemas sent on every single request. Claude Code ships ~14 tool definitions (Write, Edit, Bash, Git, Grep...) with every message — even when you're asking a read-only question that can only ever use two of them.
Raw JSON tool results. A single grep returning 60 matches came back as a ~3,400-token JSON array. File reads, test output, ls results — all shipped verbatim into the context, on every turn, forever.
Paying full price for trivial requests. "What does git stash do?" was hitting the same expensive model as "refactor this auth module."

A famous example of this failure mode: an agent burned 21,000+ input tokens fixing a one-line README typo. Stanford's Digital Economy Lab found agentic coding tasks consume ~1000x the tokens of ordinary code chat. This is not a niche problem — it's the cost structure of every agentic coding tool.

The fix: put intelligence between the agent and the model

None of this requires changing your tools. Claude Code, Cursor, and Codex CLI all let you override the API base URL. So I built a proxy that sits in the middle and does four things:

1. Strip tools the request can't use

Classify each request; a read-only question doesn't need Write/Edit/Bash schemas, so don't send them.

Measured result: 959 tokens vs 2,085 for the identical request — 53% fewer tokens, same model, same answer.

2. Compress JSON tool results

Large JSON payloads (grep output, file listings, test results) get converted to TOON, a token-oriented format, plus redundant-field stripping before they're forwarded to the model. Plain text passes through untouched.

Measured result: that 60-match grep result went from 3,458 tokens to 427 — 87.6% smaller. (Honest caveat: TOON alone typically saves ~40%; the 87.6% is TOON stacked with field-stripping on a tabular payload. Deeply nested data compresses less. Run the benchmark on your own workload.)

3. Semantic caching

If you ask "explain TCP vs UDP" and later "what's the difference between TCP and UDP?", that's the same question. Embedding similarity ≥ 0.85 → serve the cached response. 171ms, zero tokens billed.

4. Route by complexity, not by config

This is the part I haven't seen anywhere else done automatically. Each request is scored on 15 dimensions — token count, code complexity, reasoning markers, agentic signals, risk patterns — and routed to a tier you define:

TIER_SIMPLE=ollama:qwen2.5:7b          # free, local
TIER_MEDIUM=ollama:qwen2.5-coder:latest # free, local
TIER_COMPLEX=your-cloud-provider        # your API key
TIER_REASONING=your-cloud-provider

In my sessions, 70–90% of requests scored SIMPLE or MEDIUM and never left my machine. Only genuinely hard problems — architecture, tricky refactors, security analysis — hit a paid backend.

The routing is deliberately conservative in one direction: tool-heavy agentic requests don't get downgraded, because the #1 complaint with every static routing setup is cheap models fumbling tool calls (failed edits, broken git operations). Routing down is only a saving if the answer still works.

What this looks like in practice

npm install -g lynkr
lynkr init          # interactive wizard: pick your tiers and providers
lynkr start

Then point your tool at it — for Cursor it's Settings → Models → Override Base URL → http://localhost:8081/v1; for Codex CLI it's two lines in ~/.codex/config.toml. No code changes, no plugins.

Everything is self-hosted: your prompts and code never transit a third-party SaaS, there's no markup fee, and the whole thing is Apache-2.0 on GitHub.

The numbers, side by side

Benchmarked against LiteLLM v1.87.1 on identical workloads, same backend providers:

Scenario	Through Lynkr	Baseline	Delta
Tool-heavy request (14 schemas)	959 tokens	2,085 tokens	−53%
60-result grep (JSON tool output)	427 tokens	3,458 tokens	−87.6%
Repeated paraphrased query	171ms, 0 tokens	3,282ms, full price	11x faster
Complexity routing	simple→local, hard→cloud	cheapest-model-always	correctness

Projected over 100k requests/month on a tool-heavy workload: roughly half the bill, same backend, same models for the requests that matter.

Takeaways even if you never use my tool

Audit your tool schemas. They're the silent tax on every agentic request.
Never ship raw JSON into a context window. Tabular JSON is the single most compressible thing in your token stream.
Most of your requests are simple. You don't need a frontier model to explain git stash. Bring your own API keys, keep the easy 80% local, and spend where it counts.

If you try Lynkr and the numbers don't hold on your workload, open an issue with your benchmark output — I want the counterexamples: github.com/Fast-Editor/Lynkr.

I Built an LLM Gateway That Extends Claude Pro/Max Users with Azure AI Foundry, Amazon Bedrock, Local Models

Lynkr — Tue, 30 Jun 2026 22:28:50 +0000

AI coding tools have gotten very good.

But the infrastructure behind them is still weirdly inefficient.

Most tools assume one provider, one lane, one billing path.

That means the same expensive model or subscription ends up handling everything:

reading files
summarizing logs
quick repo questions
multi-file refactors
architecture planning
long debugging sessions

That is the wrong abstraction.

A coding workflow is not one type of problem. So it should not be forced through one type of model path.

That idea is what pushed me to build Lynkr.

Lynkr is an open-source LLM gateway for AI coding tools that lets me combine:

Claude Pro/Max subscription access
Azure AI Foundry-hosted models
Amazon Bedrock-hosted models
and local/free models like Ollama

behind one routing layer.

The problem with single-lane AI coding

If you use a premium coding assistant every day, you have probably seen this already.

A lot of the workload is not actually premium reasoning work.

For example:

"open this file"
"search for auth middleware"
"summarize this module"
"show me where this class is used"
"read these test failures"

These are useful requests, but they are not the same as:

"refactor this subsystem"
"design a safer auth flow"
"debug this multi-step failure"
"trace this agent loop bug"
"rewrite this implementation across five files"

Yet most tools send both classes of work through the same expensive path.

That creates three problems:

1) You waste premium capacity

If a subscription-backed or premium model handles every tiny prompt, you burn good capacity on low-value tasks.

2) You stay locked into one provider

Even if you already have access to Azure, AWS, or local models, your coding workflow is often tied to one vendor path.

3) You lose resilience

If one provider is rate-limited, degraded, or just not the best fit for a task, you have no routing layer to adjust.

The idea behind Lynkr

Lynkr sits between AI coding tools and model providers.

It works as an LLM gateway, which means the coding tool talks to Lynkr, and Lynkr decides what to do next.

That lets the gateway:

route by complexity
compress bulky tool outputs
cache repeated requests
switch providers without changing the client workflow
use different backends for different classes of tasks

The part I am most excited about is hybrid routing across:

Claude Pro/Max
Azure AI Foundry
Amazon Bedrock

What "extending Claude Pro/Max" means

The simplest version looks like this:

simple tasks → local/free model
hard coding tasks → Claude Pro/Max subscription
enterprise workloads → Azure AI Foundry
fallback or alternate routing → Amazon Bedrock

So instead of replacing Claude, Azure, or Bedrock, the gateway combines them.

This is the key idea: extend your Claude Pro/Max usage instead of burning it on everything.

Example workflow

Imagine a coding session that looks like this:

"Read the auth middleware and summarize it."

Route to a cheap local model.
"Search all routes that call this helper."

Still cheap/local.
"Refactor this auth flow to support tenant isolation."

Route to Claude Pro/Max.
"Generate an enterprise-safe variant for our internal stack."

Route to Azure AI Foundry.
"Azure is unavailable or rate-limited."

Fallback to Bedrock.

That is a much more natural way to run coding agents than pretending every prompt deserves the same model path.

Why Claude Pro/Max + Azure + Bedrock is interesting

This combination matters because each lane solves a different problem.

Claude Pro/Max

Great for high-quality coding and reasoning tasks where you already have subscription value.

Azure AI Foundry

Useful when a team wants enterprise-hosted models, internal approvals, or Azure-aligned infrastructure.

Amazon Bedrock

Useful for AWS-native orgs, alternate model access, or fallback when you want another enterprise provider path.

Local models

Useful for cheap, frequent, low-stakes tasks that should not consume premium capacity at all.

Putting these together in one gateway gives you a better operational model than any one of them alone.

Why this matters for coding agents specifically

I think coding is one of the best use cases for an LLM gateway because coding workflows are:

tool-heavy
repetitive
multi-step
full of structured outputs
sensitive to token waste
often spread across many turns

That means a gateway can add value in several ways.

1) Complexity-based routing

Not every prompt deserves the same model.

2) Cost control

Cheap requests stay cheap.

3) Better use of subscriptions

Premium capacity gets reserved for tasks that actually need it.

4) Enterprise compatibility

Teams can use Azure AI Foundry or Bedrock where policy or procurement matters.

5) Resilience

If one provider path fails, the workflow can continue.

Where MCP and agent workflows fit in

Another reason this matters is MCP and agentic tooling.

As coding tools become more agentic, they use more:

tool schemas
file reads
command outputs
structured results
long multi-turn sessions

That creates a lot of overhead and a lot of repeated context.

A gateway is the right place to optimize that.

That is also why I think the future is not just better models.

It is better routing, caching, tool handling, and workload separation around those models.

What I wanted Lynkr to do

I did not want just another OpenAI-compatible endpoint.

I wanted a gateway that could actually help with real coding economics and workflow design.

For me, that means:

keeping the coding tool workflow the same
preserving subscription value
combining subscription + cloud + local lanes
supporting enterprise backends
reducing waste on easy tasks

Who this is for

I think this is especially useful for:

Claude Code users who want more mileage from Pro/Max
teams using Azure AI Foundry for approved enterprise model access
AWS teams already standardizing on Bedrock
developers mixing local models with premium coding assistants
MCP and agent workflow builders who need an LLM gateway

Final thought

I do not think the next big improvement in AI coding comes only from stronger base models.

A lot of value will come from better infrastructure around them:

better routing
better caching
better cost control
better tool handling
better use of multiple model lanes in one workflow

That is the direction I am building toward with Lynkr.

GitHub: https://github.com/Fast-Editor/Lynkr
Ps:- This is fully following Anthropic TOS because lynkr wraps around your existing claude code

How to Use T3 Code With Claude Code and an Open-Source LLM Gateway

Lynkr — Thu, 25 Jun 2026 07:55:29 +0000

If I were setting up T3 Code for serious daily use, the stack I would want looks like this:

T3 Code
   ↓
Claude Code
   ↓
Lynkr
   ↓
Anthropic / OpenAI / Ollama / OpenRouter / Bedrock / Azure / Databricks

That flow is interesting because each layer is doing a different job:

T3 Code is the workflow and interface layer
Claude Code is the coding agent
Lynkr is the gateway layer under the agent
the model providers sit behind that gateway

That separation is the whole point.

T3 Code gives me the UX I want.
Claude Code gives me the coding behavior I want.
Lynkr gives me control over how model traffic actually gets handled.

That is a much better stack than treating the model layer as an afterthought.

Quick demo

I also recorded a short walkthrough of this setup in action:

YouTube: How to use T3Code with any model @t3dotgg

If you want the faster visual version before reading the rest, start there. The architecture is the same:

T3 Code
   ↓
Claude Code
   ↓
Lynkr
   ↓
Your actual model/provider

Why T3 Code is a useful surface

T3 Code is interesting because it is not trying to become a new model or a new lab-specific harness.

It is building a better way to work with coding agents people already use.

That is a smarter product decision than trying to replace everything at once.

Its current support includes:

Codex
Claude
Cursor
OpenCode

That means the value of T3 Code is not “one more coding assistant.”

It is more like:

one place to manage coding sessions
one place to manage projects and threads
one cleaner interface across multiple agent backends
less context-switching between separate tools

That makes a lot of sense.

But once you pick Claude Code as the coding agent inside that stack, the next problem becomes obvious:

the model layer under Claude Code matters just as much as the top-level UX.

Because once the agent is doing real work, cost and reliability stop being invisible plumbing.

They become part of the product experience.

Why Claude Code is the right example

Claude Code is a good example because it exposes the problem very clearly.

A real Claude Code session does not look like a single “generate code” call.

It looks more like:

inspect the repo
read a few files
plan a fix
call tools
generate or edit code
hit an issue
retry with more context
inspect another file
summarize the result
do another pass

That creates a traffic pattern that is very different from plain chat:

repeated system instructions
repeated repo context
repeated tool schemas
repeated state
large tool outputs
retries that quietly multiply tokens
easy turns mixed with hard reasoning turns

This is exactly why coding-agent workflows need a stronger model layer than “just point it directly at one provider.”

Once Claude Code is being used as an actual coding agent, the model path underneath it becomes infrastructure.

And infrastructure decisions compound.

The problem with wiring Claude Code directly forever

Direct setup is fine for testing.

But it gets worse as the workflow becomes more serious.

If Claude Code is always wired straight to one provider path, you get a few problems:

1. Every turn gets treated like it needs the same model

That is usually false.

Some steps are lightweight:

summarize a file
extract the likely cause of an error
choose the next action
interpret logs
reformat an answer
produce a short structured response

Some steps are genuinely expensive:

debug a multi-file integration break
reason across a large codebase
recover after several failed tool loops
refactor something deep without breaking behavior

If those all hit the same expensive path, you overpay.

2. Retries become cost multipliers

Coding agents retry all the time.

That is not a bug. That is how they work.

But retries mean the same or almost-the-same context gets resent over and over.

Without a caching layer or routing control, you keep paying full price for repeated work.

3. Tool-heavy traffic becomes the silent token killer

The expensive part is often not the user’s prompt.

It is everything around it:

tool definitions
file reads
logs
stack traces
JSON blobs
repeated state
structured outputs

That is where a lot of token waste hides.

4. Provider changes become annoying

Maybe today you want Claude for everything.

Later maybe you want:

local Ollama for cheap exploratory passes
Anthropic for hard reasoning
OpenRouter for overflow
Bedrock or Azure for enterprise constraints
a different mix for different teams

If the setup is too tightly wired, those changes become more painful than they should be.

5. Reliability problems leak into workflow

Latency spikes, rate limits, auth weirdness, provider outages, degraded outputs — eventually you hit all of them.

If there is no gateway layer, every one of those issues becomes a client-side problem.

That is exactly the kind of thing I would rather solve once in the model layer.

The split I want

This is the mental model that makes sense to me.

T3 Code handles

threads
projects
top-level UX
session management
coding workflow surface

Claude Code handles

code reasoning
edits
tool usage
the coding loop itself

Lynkr handles

routing
caching
fallback
token optimization
local/cloud backend mix
provider switching
cost control under one stable endpoint

That is a clean stack.

The interface stays separate from the agent.
The agent stays separate from the gateway.
The gateway stays separate from the providers.

That separation is valuable because it lets each layer evolve independently.

Why Lynkr fits under Claude Code

Lynkr is an open-source LLM gateway built for coding assistants, MCP-heavy workflows, and tool-heavy traffic.

That last part matters.

A lot of model-routing products talk about general-purpose requests. But coding traffic is different. It is noisier, more repetitive, and much more likely to carry large tool payloads.

That is why the fit is real here.

The role of Lynkr in this stack is not to replace Claude Code.

It is to sit under Claude Code and decide how model traffic should actually be handled.

That gives you a few levers that matter a lot in coding workflows.

1. Tier routing changes the economics

The biggest mistake people make with coding agents is asking the wrong question.

They ask:

“Which is the best coding model?”

The more useful question is:

“Which parts of my coding workflow actually deserve the expensive model?”

That is what a gateway lets you answer.

For example:

low-risk summarization can go to a cheaper/faster model
repeated inspection steps can stay local
simple classification or extraction steps do not need frontier pricing
hard debugging or refactors can escalate to a stronger path

That is a much better economic model than treating every Claude Code turn as if it deserves maximum spend.

And once that logic sits in the gateway, you do not need to keep rebuilding it at the app layer.

2. Caching matters more in coding than people think

Coding agents repeat themselves constantly.

The same instructions, the same repo background, similar prompts, similar recovery steps, similar tool outputs — they come up again and again.

That means a caching layer is not a “nice optimization.”

It is one of the biggest obvious wins in the stack.

Lynkr’s current benchmark claims are the part that stand out here:

53% fewer tokens on tool-heavy requests
87.6% compression on large JSON tool results
171ms semantic cache hits

That is exactly the kind of traffic Claude Code creates during real multi-step work.

The point is not just lower cost.

The point is lower cost and lower latency on repeated work.

That compounds very quickly.

3. Tool payload optimization is a real lever

This is one of the most under-discussed parts of coding-agent economics.

People spend a lot of time comparing model prices, but a huge amount of waste comes from the payload shape itself.

In coding workflows, the model is often seeing:

large tool schemas
verbose JSON results
long command outputs
repeated file excerpts
repeated structured state

That means reducing payload size is often just as important as picking the right provider.

This is why gateway-level optimization makes sense.

It is solving a real problem in the actual traffic pattern, not just shuffling providers around.

4. T3 Code stays stable while the model layer evolves

This is maybe the biggest architectural reason I like this stack.

If T3 Code points to Claude Code, and Claude Code points to Lynkr, then the top-level workflow can remain stable while the backend policy changes underneath.

That means I can change:

default providers
local/cloud mix
fallback policy
cache behavior
cost policy
model tiers

…without having to rethink the interface and workflow every time.

That is a better long-term design.

The UI layer should not be where I want model policy to live.

5. Local-first and fallback become much easier

There are plenty of steps in a coding workflow that can be handled locally or by a cheaper model path.

There are also plenty of steps where I want a stronger cloud model.

A gateway makes that hybrid model much easier.

For example:

local model for lightweight repo inspection
stronger provider for hard debugging
cloud fallback when local output is not good enough
alternate provider when the main path is slow or unavailable

That kind of setup is a lot harder to maintain cleanly when every client is wired directly.

Example of the architecture in practice

The point is not that T3 Code itself becomes the gateway.

The point is that the stack stays layered:

T3 Code
   ↓
Claude Code
   ↓
Lynkr
   ↓
Anthropic / OpenAI / Ollama / OpenRouter / Bedrock / Azure / Databricks

That gives you:

a clean interface at the top
a strong coding agent in the middle
one stable gateway layer underneath
swappable providers behind that

That is the shape I would trust more over time.

Why this matters for people using T3 Code seriously

If you are trying T3 Code casually, none of this matters much.

But if you are actually using it for repeated coding workflows, then it starts to matter fast.

Because daily coding-agent usage means:

lots of repeated calls
lots of tool-heavy turns
more retries than you expected
more context repetition than you expected
more need for backend flexibility than you expected

That is when the gateway stops being optional architecture theory and starts becoming the practical layer that controls cost and reliability.

Final take

If I were using T3 Code with Claude Code, I would not want Claude Code wired directly to one backend forever.

I would want:

T3 Code for workflow
Claude Code for coding behavior
Lynkr for routing, caching, fallback, and cost control
multiple providers behind that gateway

That feels like the right stack for where coding tools are going.

Better UX at the top.

Better agent behavior in the middle.

Better economics and control underneath.

If you want to check the projects:

T3 Code: https://github.com/pingdotgg/t3code
Lynkr: https://github.com/Fast-Editor/Lynkr

Why I’d Use a LLM gateway with Goose

Lynkr — Wed, 17 Jun 2026 07:15:15 +0000

Open-source coding agents are getting a lot more useful, and Goose is one of the clearest examples of that shift.

Goose is an open-source AI agent that goes beyond autocomplete. It can inspect code, execute tasks, edit files, and work through real development loops that look much closer to install → execute → edit → test than traditional code assistance.

That also means Goose creates the exact kind of workload where the model layer starts to matter a lot.

Once an agent is reading files, retrying commands, generating code, reasoning across context, and iterating through multi-step tasks, the cost and reliability of your model setup stops being a background detail. It becomes part of the product experience.

That’s why I think the cleaner architecture is:

Goose
  ↓
Lynkr
  ↓
OpenAI / Anthropic / Ollama / OpenRouter / Bedrock / Azure

In other words: use Goose as the coding agent, and use Lynkr as the LLM gateway underneath it.

What Goose is

If you haven’t looked at it yet, Goose is an open-source, extensible AI agent built for more than just code suggestions. The project describes it as an agent that can install, execute, edit, and test with any LLM, which is exactly why it’s interesting.

That framing matters.

A lot of developer AI tooling still assumes the model is mostly there to answer questions or generate snippets. Goose is part of the newer wave where the model is expected to participate in a real workflow. That means the token pattern changes too:

more repeated context
more tool-style back and forth
more retries
more multi-step reasoning
more chances to waste expensive model calls on easy tasks

That’s where a gateway helps.

What Lynkr does in this setup

Lynkr is an open-source LLM gateway. Instead of wiring Goose directly to a single provider, you point Goose at Lynkr and let Lynkr handle the model layer underneath.

That gives you one control point for:

provider switching
local + cloud model setups
fallback handling
routing
caching
cleaner long-term infrastructure

Goose stays focused on the agent workflow. Lynkr stays focused on how requests should reach the right model.

Why this matters for coding agents specifically

If you only make occasional direct API calls, model choice is simple.

If you use an agent heavily, it isn’t.

A Goose session can easily include:

reading repo context
planning a change
generating code
fixing an error
retrying with more context
running another step
revisiting earlier files

That is not one request. It is a chain of requests with different complexity levels.

Some of those steps can run on a cheaper or local model. Some need a stronger cloud model. Some repeat enough context that caching matters. Some need a fallback path because a provider slows down or fails mid-session.

Without a gateway, that logic ends up scattered or simply ignored.

With a gateway, you can manage it in one place.

Basic idea: point Goose at Lynkr instead of a raw provider

The exact Goose setup may vary depending on how you run it, but the architecture is straightforward:

Goose talks to one model endpoint
that endpoint is Lynkr
Lynkr forwards to the real provider you want underneath

A typical environment setup looks like this:

export OPENAI_API_BASE=http://localhost:3000/v1
export OPENAI_API_KEY=dummy

Then run Goose normally:

goose

Or for a direct task:

goose run "Review this repo and suggest 3 refactors"

In this flow, Goose thinks it’s talking to its configured LLM endpoint. Lynkr handles what happens next.

Example 1: Run Goose on a local model through Lynkr

Let’s say you want Goose to use a local coding model first.

A simple Lynkr config might look like this:

providers:
  - name: local-coder
    type: ollama
    model: qwen2.5-coder:14b

routing:
  default: local-coder

Then:

export OPENAI_API_BASE=http://localhost:3000/v1
export OPENAI_API_KEY=dummy

goose run "Explain this repository structure and identify dead code"

Why do this instead of connecting Goose directly to Ollama?

Because once Goose is pointed at Lynkr, you can change the backend later without changing the Goose-side integration.

That means you can start local, then later:

switch to a better coding model
add a cloud fallback
route specific workloads differently
keep the same stable endpoint for Goose

Example 2: Local-first, cloud fallback

A more realistic setup is usually local-first with a stronger cloud fallback.

providers:
  - name: local-fast
    type: ollama
    model: qwen2.5-coder:14b

  - name: cloud-strong
    type: anthropic
    model: claude-sonnet-4

routing:
  default: local-fast
  fallback: cloud-strong

Then configure Goose to talk to Lynkr:

export ANTHROPIC_BASE_URL=http://localhost:3000
export ANTHROPIC_API_KEY=dummy

goose run "Debug why the integration tests are failing and propose a patch"

This gives you a much nicer operating model:

cheap/local by default
stronger cloud help when needed
Goose workflow stays the same

Example 3: One Goose workflow, multiple providers behind it

One of the biggest advantages of putting a gateway under a coding agent is that your model preferences change all the time.

Sometimes you want:

a fast model for lighter steps
a stronger model for code generation
a local model for private work
a backup provider when your main one rate-limits

With Lynkr, you don’t need to keep reworking Goose every time you change that strategy.

Example:

providers:
  - name: fast
    type: openrouter
    model: openai/gpt-4o-mini

  - name: coder
    type: anthropic
    model: claude-sonnet-4

  - name: local
    type: ollama
    model: qwen2.5-coder:14b

routing:
  default: coder
  fallback: fast

Goose still uses the same top-level environment variables:

export OPENAI_API_BASE=http://localhost:3000/v1
export OPENAI_API_KEY=dummy

That’s the part I like most about the gateway pattern: the agent stays stable while the model layer evolves underneath it.

Where Lynkr becomes especially useful

There are a few situations where this setup becomes much more valuable than direct provider wiring.

1. You want to avoid vendor lock-in

If Goose is wired straight to one provider, every change becomes a reconfiguration problem.

If Goose is wired to Lynkr, provider changes happen underneath the same gateway layer.

2. You want local + cloud flexibility

A lot of developers want a local-first workflow but still need access to stronger cloud models when tasks get harder.

That’s much cleaner when Goose talks to one gateway instead of multiple provider-specific setups.

3. You want better cost control

Agent workflows can burn tokens in places that don’t need premium models.

A gateway gives you a place to route easier work more cheaply.

4. You want a more future-proof stack

Coding agents are changing fast. Model providers are changing fast too.

A stable gateway layer gives you a cleaner architecture than coupling every tool directly to every provider.

A practical mental model

The easiest way to think about this is:

Goose = behavior layer
Lynkr = model control layer

Goose decides what work to do.
Lynkr decides where that work should go.

That separation gets more useful as your workflows get more agentic.

Final thoughts

Goose is part of a bigger shift in developer tools. We’re moving from AI assistants that mostly answer questions to coding agents that can actually work through tasks.

As that shift happens, the model layer matters more.

If you connect Goose directly to a provider, it works.

If you connect Goose to Lynkr, you get a cleaner long-term setup:

one stable gateway
easier provider switching
local/cloud flexibility
fallback support
better control over how your coding agent uses models

That’s why I’d rather put Goose on top of an LLM gateway than wire it straight to a raw provider.

If you’re already experimenting with Goose, this is one of the simplest ways to make the setup more flexible without changing the agent workflow itself.

GitHub

Goose: https://github.com/aaif-goose/goose
Lynkr: https://github.com/Fast-Editor/Lynkr