DEV Community: 1bcMax

We Read 100 OpenClaw Issues About OpenRouter. Here's What We Built Instead.

1bcMax — Sat, 21 Mar 2026 23:29:27 +0000

OpenRouter is the most popular LLM aggregator. It's also the source of the most frustration in OpenClaw's issue tracker.

The Data

We searched OpenClaw's GitHub issues for "openrouter" and read every result. 100 issues. Open and closed. Filed by users who ran into the same structural problems over and over.

Category	Issues	Examples
Broken fallback / failover	~20	#22136, #45663, #50389, #49079
Model ID mangling	~15	#49379, #50711, #25665, #2373
Authentication / 401 errors	~8	#51056, #34830, #26960
Cost / billing opacity	~6	#25371, #50738, #38248
Routing opacity	~5	#7006, #35842
Missing feature parity	~10	#46255, #50485, #30850
Rate limit / key exhaustion	~4	#8615, #48729
Model catalog staleness	~5	#10687, #30152

These aren't edge cases. They're structural consequences of how OpenRouter works: a middleman that adds latency, mangles model IDs, obscures routing decisions, and introduces its own failure modes on top of the providers it aggregates.

1. Broken Fallback — The #1 Pain Point

From #45663:

"Provider returned error from OpenRouter does not trigger model failover."

From #50389:

"Rate limit errors surfaced to user instead of auto-failover."

When OpenRouter returns a 429 or provider error, OpenClaw's failover logic often doesn't recognize it as retriable. The user sees a raw error. The agent stops. ~20 issues document variations of this: HTTP 529 (Anthropic overloaded) not triggering fallback (#49079), invalid model IDs causing 400 instead of failover (#50017), timeouts in cron sessions with no recovery (#49597).

How ClawRouter Solves This

ClawRouter maintains 8-deep fallback chains per routing tier. When a model fails:

200ms retry — short-burst rate limits often recover in milliseconds
Next model — if retry fails, move to the next model in the chain
Per-model isolation — one provider's failure doesn't poison the others
All-failed summary — if every model in the chain fails, you get a structured error listing every attempt and failure reason

[ClawRouter] Trying model 1/6: google/gemini-2.5-flash
[ClawRouter] Model google/gemini-2.5-flash returned 429, retrying in 200ms...
[ClawRouter] Retry failed, trying model 2/6: deepseek/deepseek-chat
[ClawRouter] Success with model: deepseek/deepseek-chat

No silent failures. No raw 429s surfaced to the agent.

2. Model ID Mangling — Death by Prefix

From #25665:

"Model config defaults to openrouter/openrouter/auto (double prefix)."

From #50711:

"Control UI model picker strips openrouter/ prefix."

OpenRouter uses nested model IDs: openrouter/deepseek/deepseek-v3.2. OpenClaw's UI, Discord bot, and web gateway all handle these differently. Some add the prefix. Some strip it. Some double it. 15 issues trace back to model ID confusion.

How ClawRouter Solves This

Clean aliases. You say sonnet and get anthropic/claude-sonnet-4-6. You say flash and get google/gemini-2.5-flash. No nested prefixes. No double-prefix bugs.

// resolveModelAlias() handles all normalization
"sonnet"     → "anthropic/claude-sonnet-4-6"
"opus"       → "anthropic/claude-opus-4-6"
"flash"      → "google/gemini-2.5-flash"
"grok"       → "xai/grok-4-0314"
"deepseek"   → "deepseek/deepseek-chat"

One canonical format. No mangling. No UI inconsistency.

3. API Key Hell — 401s, Leakage, and Rotation

From #51056:

"OpenRouter fails with '401 Missing Authentication header' despite valid key."

From #8615:

"Feature request: native multi-API-key support with load balancing and fallback."

API keys are the root cause of an entire category of failures. Keys expire. Keys leak into LLM context — every provider sees every other provider's keys in the serialized request. Keys hit rate limits that can't be load-balanced. 8 issues document auth failures alone.

How ClawRouter Solves This

ClawRouter has no API keys. Zero.

Payment happens via x402 — a cryptographic micropayment protocol. Your agent generates a wallet on first run (BIP-44 derivation, both EVM and Solana). Each request is signed with the wallet's private key. USDC moves per-request.

No keys to leak. No keys to rotate. No keys to rate-limit. No keys to expire.

The wallet is the identity. The signature is the authentication. Nothing to configure, nothing to paste into a config file, nothing for the LLM to accidentally serialize.

4. Cost and Billing Opacity — Surprise Bills

From #25371:

"OpenRouter 402 billing error misclassified as 'Context overflow', triggering auto-compaction that drains remaining credits faster."

From #7006:

"openrouter/auto doesn't expose which model was actually used or its cost."

When OpenRouter runs out of credits, it returns a 402 that OpenClaw misreads as a context overflow. OpenClaw then auto-compacts the context and retries — on the same empty balance. Each retry charges the compaction cost. Credits drain faster. The agent burns money trying to fix a billing error it doesn't understand.

How ClawRouter Solves This

Per-request cost visibility. Every response includes cost headers:

x-clawrouter-cost: 0.003400
x-clawrouter-savings: 82%
x-clawrouter-model: google/gemini-2.5-flash

Per-request USDC payments. No prepaid balance to drain. Each request shows its price before you pay. When the wallet is empty, requests don't fail — they fall back to the free tier (NVIDIA GPT-OSS-120B).

Budget guard. maxCostPerRun caps per-session spending. Two modes: graceful (downgrade to cheaper models) or strict (hard stop). The $248/day heartbeat scenario is structurally impossible.

Usage logging. Every request logs to ~/.openclaw/blockrun/logs/usage-YYYY-MM-DD.jsonl with model, tier, cost, baseline cost, savings, and latency. /stats shows the breakdown.

5. Routing Opacity — "Which Model Did I Just Pay For?"

From #7006:

"No visibility into which model openrouter/auto actually uses."

From #35842:

"Need explicit Claude Sonnet default instead of auto-routing."

When you use openrouter/auto, you don't know what model served your request. You can't debug quality regressions. You can't understand cost spikes. You're paying for a black box.

How ClawRouter Solves This

ClawRouter's routing is 100% local, open-source, and transparent.

A 14-dimension weighted classifier runs locally in <1ms. It scores every request across: token count, code presence, reasoning markers, technical terms, multi-step patterns, question complexity, tool signals, and more.

Debug headers on every response:

x-clawrouter-profile: auto
x-clawrouter-tier: MEDIUM
x-clawrouter-model: moonshot/kimi-k2.6
x-clawrouter-confidence: 0.87
x-clawrouter-reasoning: "Code task with moderate complexity"

SSE debug comments in streaming responses show the routing decision inline. You always know which model, why it was selected, and how confident the classifier was.

Four routing profiles give you explicit control:

Profile	Behavior	Savings
`auto`	Balanced quality + cost	74–100%
`eco`	Cheapest possible	95–100%
`premium`	Best quality always	0%
`free`	NVIDIA GPT-OSS only	100%

No black box. No mystery routing. Full visibility, full control.

6. Missing Feature Parity — Images, Tools, Caching

From #46255:

"Images not passed to OpenRouter models."

From #47707:

"Mistral models fail with strict tool call ID requirements."

OpenRouter doesn't always pass through provider-specific features correctly. Image payloads get dropped. Cache retention headers get ignored. Tool call ID formats cause silent failures with strict providers.

How ClawRouter Solves This

Vision auto-detection. When image_url content parts are detected, ClawRouter automatically filters the fallback chain to vision-capable models only. No images dropped.

Tool calling validation. Every model has a toolCalling flag. When tools are present in the request, ClawRouter forces agentic routing tiers and excludes models without tool support. No silent tool call failures.

Direct provider routing. ClawRouter routes through BlockRun's API directly to providers — not through a second aggregator. One hop, not two. Provider-specific features work because there's no middleman translating them.

7. Model Catalog Staleness — "Where's the New Model?"

From #10687:

"Need fully dynamic model discovery."

From #30152:

"Allowlist silently drops models not in catalog."

When new models launch, OpenRouter's catalog lags. Users configure a model that exists at the provider but isn't in the catalog. The request fails silently or gets rerouted.

How ClawRouter Solves This

ClawRouter maintains a curated catalog of 46+ models across 8 providers, updated with each release. Delisted models have automatic redirect aliases:

// Delisted models redirect automatically
"xai/grok-code-fast-1"  → "deepseek/deepseek-chat"
"google/gemini-2.0-pro"  → "google/gemini-3.1-pro"

No silent drops. No stale catalog. Models are benchmarked for speed, quality, and tool support before inclusion.

The Full Comparison

	OpenRouter	ClawRouter
Authentication	API key (leak risk)	Wallet signature (no keys)
Payment	Prepaid balance (custodial)	Per-request USDC (non-custodial)
Routing	Server-side black box	Local 14-dim classifier, <1ms
Fallback	Often broken (20+ issues)	8-deep chains, per-model isolation
Model IDs	Nested prefixes, mangling bugs	Clean aliases, single format
Cost visibility	None per-request	Headers + JSONL logs + `/stats`
Empty wallet	Request fails	Auto-fallback to free tier
Rate limits	Per-key, shared	Per-wallet, independent
Vision support	Images sometimes dropped	Auto-detected, vision-only fallback
Tool calling	Silent failures with some models	Flag-based filtering, guaranteed support
Model catalog	Laggy, silent drops	Curated 46+ models, redirect aliases
Budget control	Monthly invoice	Per-session cap (`maxCostPerRun`)
Setup	Create account, paste key	Agent generates wallet, auto-configured
Average cost	$25/M tokens (Opus direct)	$2.05/M tokens (auto-routed) = 92% savings

Getting Started

# Install
npm install -g @blockrun/clawrouter

# Start (auto-configures OpenClaw)
clawrouter

# Check your wallet
# /wallet

# View routing stats
# /stats

ClawRouter auto-injects itself into ~/.openclaw/openclaw.json as a provider on startup. Your existing tools, sessions, and extensions are unchanged.

Load a wallet with USDC on Base or Solana, pick a routing profile, and run.

GitHub · blockrun.ai · npm install -g @blockrun/clawrouter

AI Agent API Costs: How ClawRouter Cuts LLM Spending by 500x

1bcMax — Sat, 21 Mar 2026 03:05:16 +0000

OpenClaw is one of the best AI agent frameworks available. Its LLM abstraction layer is not.

The $248/Day Problem

From openclaw/openclaw#3181:

"We ended up at $248/day before we caught it. Heartbeat on Opus 4.6 with a large context. The dedup fix reduced trigger rate, but there's nothing bounding the run itself."

"11.3M input tokens in 1 hour on claude-opus-4-6 (128K context), ~$20/hour."

Both users ended up disabling heartbeat entirely. The workaround: heartbeat.every: "0" — turning off the feature to avoid burning money.

The root cause isn't configuration error. It's that OpenClaw's LLM layer has no concept of what things cost, and no way to stop a run that's spending too much.

What OpenClaw Gets Wrong at the Inference Layer

OpenClaw is an excellent orchestration framework — session management, tool dispatch, agent routing, memory. But every request it makes hits a single configured model with no awareness of:

Cost tier — A heartbeat status check doesn't need Opus. A file read result doesn't need 128K context. OpenClaw sends both to the same model at the same price.

Rate limit isolation — When one provider hits a 429, OpenClaw's failover logic applies that cooldown to the entire profile, not just the offending model. Every model in the same group is penalized (#49834). If you configured 5 models for fallback, one slow provider can block all of them.

Empty/degraded responses — Some providers return HTTP 200 with empty content, repeated tokens, or a single newline. OpenClaw passes this through to the agent. The agent either errors out, loops, or silently gets a blank response (#49902).

Error semantics — OpenClaw's failover logic has known gaps. We found and fixed two while building ClawRouter:

MiniMax HTTP 520 (PR #49550) — MiniMax returns {"type":"api_error","message":"unknown error, 520 (1000)"} for transient server errors. OpenClaw's classifier required both "type":"api_error" AND the string "internal server error". MiniMax fails the second check. Result: no failover, silent failure, retry storm.
Z.ai codes 1311 and 1113 (PR #49552) — Z.ai error 1311 means "model not on your plan" (billing — stop retrying). Error 1113 means "wrong endpoint" (auth — rotate key). Both fell through to null, got treated as rate_limit, triggered exponential backoff, and charged for every retry.

Context size — Agents accumulate context. A 10-message conversation with tool results can easily hit 40K+ tokens. OpenClaw sends the full context every request, on every retry.

ClawRouter: Built for Agentic Workloads

ClawRouter is a local OpenAI-compatible proxy, purpose-built for how AI agents actually behave — not how simple chat clients do. It sits between OpenClaw and the upstream model APIs.

OpenClaw → ClawRouter → blockrun.ai → GPT-4o / Opus / Gemini / ...
                ↑
         All the smart stuff happens here

1. Token Compression — 7 Layers, Agent-Aware

Agents are the worst offenders for context bloat. Tool call results are verbose. File reads return thousands of lines. Conversation history compounds with every turn.

ClawRouter compresses every request through 7 layers before it hits the wire:

Layer	What it does	Saves
Deduplication	Removes repeated messages (retries, echoes)	Variable
Whitespace	Strips excessive whitespace from all content	2-8%
Dictionary	Replaces common phrases with short codes	5-15%
Path shortening	Codebook for repeated file paths in tool results	3-10%
JSON compaction	Removes whitespace from embedded JSON	5-12%
Observation compression	Summarizes tool results to key information	Up to 97%
Dynamic codebook	Learns repetitions in the actual conversation	3-15%

Layer 6 is the big one. Tool results — file reads, API responses, shell output — can be 10KB+ each. The actual useful signal is often 200-300 chars. ClawRouter extracts errors, status lines, key JSON fields, and compresses the rest. Same model intelligence, 97% fewer tokens on the bulk.

Overall reduction: 15-40% on typical agentic workloads. On the $248/day scenario, that's $150-$200/day in savings from compression alone, before any routing changes.

2. Automatic Tier Routing — Right Model for Each Request

ClawRouter classifies every request before forwarding:

heartbeat status check     →  SIMPLE   →  gemini-2.5-flash      (~0.04¢ / request)
code review, refactor      →  COMPLEX  →  claude-sonnet-4-6      (~5¢ / request)
formal proof, reasoning    →  REASONING →  o3 / claude-opus      (~30¢ / request)

Tool detection is automatic. When OpenClaw sends a request with tools attached, ClawRouter forces agentic routing tiers — guaranteeing tool-capable models and preventing the silent fallback to models that refuse tool calls.

Session pinning. Once a session selects a model for a task, ClawRouter pins that model for the session lifetime. No mid-task model switching, no consistency issues across a long agent run.

The heartbeat that was burning $248/day on Opus routes to Flash at ~1/500th the cost. Same heartbeat feature, working as designed.

3. Per-Model Rate Limit Isolation — No Cross-Contamination

When a provider returns 429, ClawRouter marks that specific model as rate-limited for 60 seconds (#49834). Other models in the fallback chain are unaffected. If Claude Sonnet gets rate-limited, Gemini Flash and GPT-4o continue working. No cascade.

Before failing over, ClawRouter also retries the rate-limited model once after 200ms. Token-bucket limits often recover within milliseconds — most short-burst 429s resolve on the first retry without ever touching a fallback model.

4. Empty Response Detection — No Silent Failures

ClawRouter inspects every HTTP 200 response body before forwarding it (#49902). Blank responses, repeated-token loops, and single-character outputs trigger model fallback — the same as a 5xx. The agent never sees a degraded response that would cause it to loop or silently fail.

5. Correct Error Classification — No Retry Storms

ClawRouter classifies errors at the HTTP/body layer before OpenClaw sees them:

401 / 403              → auth_failure    → stop retrying, rotate key
402 / billing body     → quota_exceeded  → stop retrying, surface alert
429                    → rate_limited    → backoff, try next model
529 / overloaded body  → overloaded      → short cooldown, fallback model
5xx / 520              → server_error    → retry with different model
Z.ai 1311              → billing         → stop retrying
Z.ai 1113              → auth            → rotate key
MiniMax 520 (api_error)→ server_error    → retry with fallback

Per-provider error state is tracked independently. If MiniMax is having a bad hour, Anthropic and OpenAI routes continue working. No cross-contamination, no single provider poisoning the session.

6. Session Memory — Agents That Remember

OpenClaw sessions can be long-lived. ClawRouter maintains a session journal — extracting decisions, results, and context from each turn — and injects relevant history when the agent asks questions that reference earlier work.

Less context repeated = fewer tokens = lower cost. Agents that need to recall earlier decisions don't need to carry the entire history in every prompt.

7. x402 Micropayments — Wallet-Based Budget Control

ClawRouter pays for inference via x402 USDC micropayments (Base or Solana). You load a wallet. Each inference call costs exactly what it costs. When the wallet runs low, requests stop cleanly.

There is no monthly invoice. There is no 3am email. There is a wallet balance, and it either has funds or it doesn't. Wallet-based billing means your budget stops the burn — not a monthly invoice that arrives after the damage is done.

maxCostPerRun — a per-session cost ceiling that stops or downgrades requests once a session exceeds a configured threshold (e.g., $0.50). This closes the remaining gap (#3181) where a wallet with sufficient funds can still accumulate within a single run. Two modes: graceful (downgrade to cheaper models) and strict (hard 429 once the cap is hit).

41+ models. One wallet. Pay per call.

OpenClaw + ClawRouter: The Full Picture

Problem	OpenClaw alone	OpenClaw + ClawRouter
Heartbeat cost overrun	No per-run cap	Tier routing → 50-500x cheaper model
Large context	Full context every call	7-layer compression, 15-40% reduction
Tool result bloat	Raw output forwarded	Observation compression, up to 97%
Rate limit contaminates profile	All models penalized (#49834)	Per-model 60s cooldown, others unaffected
Empty / degraded 200 response	Passed through to agent (#49902)	Detected, triggers model fallback
Short-burst 429 failover	Immediate failover to next model	200ms retry first, failover only if needed
MiniMax 520 failure	Silent drop / retry storm	Classified as server_error, retried correctly
Z.ai 1311 (billing)	Treated as rate_limit, retried	Classified as billing, stopped immediately
Mid-task model switch	Model can change mid-session	Session pinning, consistent model per task
Monthly billing surprise	Possible	Wallet-based, stops when empty
Per-session cost ceiling	None	`maxCostPerRun` — graceful or strict cap
Cost visibility	None	`/stats` with per-provider error counts

Getting Started

# 1. Install with smart routing enabled
curl -fsSL https://blockrun.ai/ClawRouter-update | bash
openclaw gateway restart

ClawRouter auto-injects itself into ~/.openclaw/openclaw.json as a provider on startup. No manual config needed — your existing tools, sessions, and extensions are unchanged.

Load a wallet, choose a model profile (eco / auto / premium / agentic), and run.

On Our OpenClaw Contributions

We contribute upstream when we find bugs. The two PRs linked above fix real error classification gaps. Everyone using OpenClaw directly benefits.

ClawRouter exists because proxy-layer cost control, context compression, and agent-aware routing are fundamentally gateway concerns — not framework concerns. OpenClaw can't know that your heartbeat doesn't need Opus. It can't compress tool results it hasn't seen. It can't enforce a wallet ceiling.

That's what ClawRouter is for.

github.com/BlockRunAI/ClawRouter · blockrun.ai · npm install -g @blockrun/clawrouter

LLM Router Benchmark: 46 Models, 8 Providers, Sub-1ms Routing

1bcMax — Sat, 21 Mar 2026 03:04:21 +0000

When you route AI requests across 46 models from 8 providers, you can't just pick the cheapest one. You can't just pick the fastest one either. We learned this the hard way.

This is the technical story of how we benchmarked every model on our platform, discovered that speed and intelligence are poorly correlated, and built a production routing system that classifies requests in under 1ms using 14 weighted dimensions with sigmoid confidence calibration.

The Problem: One Gateway, 46 Models, Infinite Wrong Choices

BlockRun is an x402 micropayment gateway. Every LLM request flows through our proxy, gets authenticated via on-chain USDC payment, and is forwarded to the appropriate provider. The payment overhead adds 50-100ms to every request.

Our users set model: "auto" and expect us to pick the right model. But "right" means different things for different requests:

A "what is Python?" query should route to the cheapest, fastest model
A "implement a B-tree with concurrent insertions" query needs a capable model
A "prove this theorem step by step" query needs reasoning capabilities
An agentic workflow with tool calls needs models that follow instructions precisely

We needed a system that could classify any request and route it to the optimal model in real-time.

Step 1: Benchmarking the Fleet

Before building the router, we needed ground truth. We benchmarked all 46 models through our production payment pipeline.

Methodology

Setup:     ClawRouter v0.12.47 proxy on localhost
           → BlockRun x402 gateway (Base EVM chain)
           → Provider APIs (OpenAI, Anthropic, Google, xAI, DeepSeek, Moonshot, MiniMax, NVIDIA, Z.AI)

Prompts:   3 Python coding tasks (IPv4 validation, LCS algorithm, LRU cache)
           2 requests per model per prompt
Config:    256 max tokens, non-streaming, temperature 0.7
Measured:  End-to-end wall clock time (includes x402 payment verification)

This is not a synthetic benchmark. Every measurement includes the full payment-verification round trip that real users experience.

The Latency Landscape

Results revealed a 7x spread between the fastest and slowest models:

FAST TIER (<1.5s):
  xai/grok-4-fast           1,143ms   224 tok/s   $0.20/$0.50
  xai/grok-3-mini           1,202ms   215 tok/s   $0.30/$0.50
  google/gemini-2.5-flash   1,238ms   208 tok/s   $0.30/$2.50
  google/gemini-2.5-pro     1,294ms   198 tok/s   $1.25/$10.00
  google/gemini-3-flash     1,398ms   183 tok/s   $0.50/$3.00
  deepseek/deepseek-chat    1,431ms   179 tok/s   $0.28/$0.42

MID TIER (1.5-2.5s):
  google/gemini-3.1-pro     1,609ms   167 tok/s   $2.00/$12.00
  moonshot/kimi-k2.5        1,646ms   156 tok/s   $0.60/$3.00
  anthropic/claude-sonnet   2,110ms   121 tok/s   $3.00/$15.00
  anthropic/claude-opus     2,139ms   120 tok/s   $5.00/$25.00
  openai/o3-mini            2,260ms   114 tok/s   $1.10/$4.40

SLOW TIER (>3s):
  openai/gpt-5.2-pro        3,546ms    73 tok/s   $21.00/$168.00
  openai/gpt-4o             5,378ms    48 tok/s   $2.50/$10.00
  openai/gpt-5.4            6,213ms    41 tok/s   $2.50/$15.00
  openai/gpt-5.3-codex      7,935ms    32 tok/s   $1.75/$14.00

Two clear patterns:

Google and xAI dominate speed. 11 of the top 13 fastest models are from Google or xAI.
OpenAI flagship models are consistently slow. Every GPT-5.x model takes 3-8 seconds. Even their cheapest models (GPT-4.1-nano at $0.10/$0.40) are 2x slower than Google's cheapest.

Step 2: Adding the Quality Dimension

Speed alone tells you nothing about whether a model can actually handle your request. We cross-referenced our latency data with Artificial Analysis Intelligence Index v4.0 scores (composite of GPQA, MMLU, MATH, HumanEval, and other benchmarks):

MODEL                       LATENCY    IQ    $/M INPUT
─────────────────────────────────────────────────────
google/gemini-3.1-pro       1,609ms    57    $2.00    ← SWEET SPOT
openai/gpt-5.4              6,213ms    57    $2.50
openai/gpt-5.3-codex        7,935ms    54    $1.75
anthropic/claude-opus-4.6   2,139ms    53    $5.00
anthropic/claude-sonnet-4.6 2,110ms    52    $3.00
google/gemini-3-pro-prev    1,352ms    48    $2.00
moonshot/kimi-k2.5          1,646ms    47    $0.60
google/gemini-3-flash-prev  1,398ms    46    $0.50    ← VALUE SWEET SPOT
xai/grok-4                  1,348ms    41    $0.20
xai/grok-4.1-fast           1,244ms    41    $0.20
deepseek/deepseek-chat      1,431ms    32    $0.28
xai/grok-4-fast             1,143ms    23    $0.20
google/gemini-2.5-flash     1,238ms    20    $0.30

The Efficiency Frontier

Plotting IQ against latency reveals a clear efficiency frontier:

IQ
57 |  Gem3.1Pro ·························· GPT-5.4
   |
53 |                    · Opus
52 |                   · Sonnet
   |
48 |  Gem3Pro ·
47 |   · Kimi
46 |  Gem3Flash ·
   |
41 |  Grok4 ·
   |
32 | Grok3 · · DeepSeek
   |
23 | GrokFast ·
20 | GemFlash ·
   └──────────────────────────────────────────────
     1.0   1.5   2.0   2.5   3.0        6.0  8.0
                 End-to-End Latency (seconds)

The frontier runs from Gemini 2.5 Flash (IQ 20, 1.2s) up to Gemini 3.1 Pro (IQ 57, 1.6s). Everything above and to the right of this line is dominated — you can get equal or better quality at lower latency from a different model.

Key insight: Gemini 3.1 Pro matches GPT-5.4's IQ at 1/4 the latency and lower cost. Claude Sonnet 4.6 nearly matches Opus 4.6 quality at 60% of the price. These dominated pairings directly informed our routing fallback chains.

Step 3: The Failed Experiment (Latency-First Routing)

Armed with benchmark data, we initially optimized for speed. The routing config promoted fast models:

// v0.12.47 — latency-optimized (REVERTED)
COMPLEX: {
  primary: "xai/grok-4-0709",           // 1,348ms, IQ 41
  fallback: [
    "xai/grok-4-1-fast-non-reasoning",  // 1,244ms, IQ 41
    "google/gemini-2.5-flash",           // 1,238ms, IQ 20
    // ... fast models first
  ],
}

Users complained within 24 hours. The fast models were refusing complex tasks and giving shallow responses. A model with IQ 41 can't reliably handle architecture design or multi-step code generation, no matter how fast it is.

Lesson: optimizing for a single metric in a multi-objective system creates failure modes. We needed to optimize across speed, quality, and cost simultaneously.

Step 4: The 14-Dimension Scoring System

The router needs to determine what kind of request it's looking at before selecting a model. We built a rule-based classifier that scores requests across 14 weighted dimensions:

Architecture

User Prompt → Lowercase + Tokenize
                    ↓
            ┌──────────────────────────────────┐
            │   14 Dimension Scorers           │
            │   Each returns score ∈ [-1, 1]   │
            └──────┬───────────────────────────┘
                   ↓
            Weighted Sum (configurable weights)
                   ↓
            Tier Boundaries (SIMPLE < 0.0 < MEDIUM < 0.3 < COMPLEX < 0.5 < REASONING)
                   ↓
            Sigmoid Confidence Calibration
                   ↓
            confidence < 0.7 → AMBIGUOUS → default to MEDIUM
            confidence ≥ 0.7 → Classified tier
                   ↓
            Tier × Profile → Model Selection

The 14 Dimensions

Dimension	Weight	What It Detects	Score Range
reasoningMarkers	0.18	"prove", "theorem", "step by step"	0 to 1.0
codePresence	0.15	"function", "class", "import",

``` | 0 to 1.0 |
| multiStepPatterns | 0.12 | "first...then", "step N", numbered lists | 0 or 0.5 |
| technicalTerms | 0.10 | "algorithm", "kubernetes", "distributed" | 0 to 1.0 |
| tokenCount | 0.08 | Short (<50 tokens) vs long (>500 tokens) | -1.0 to 1.0 |
| creativeMarkers | 0.05 | "story", "poem", "brainstorm" | 0 to 0.7 |
| questionComplexity | 0.05 | Number of question marks (>3 = complex) | 0 or 0.5 |
| agenticTask | 0.04 | "edit", "deploy", "fix", "debug" | 0 to 1.0 |
| constraintCount | 0.04 | "at most", "within", "O()" | 0 to 0.7 |
| imperativeVerbs | 0.03 | "build", "create", "implement" | 0 to 0.5 |
| outputFormat | 0.03 | "json", "yaml", "table", "csv" | 0 to 0.7 |
| simpleIndicators | 0.02 | "what is", "hello", "define" | 0 to -1.0 |
| referenceComplexity | 0.02 | "the code above", "the API docs" | 0 to 0.5 |
| domainSpecificity | 0.02 | "quantum", "FPGA", "genomics" | 0 to 0.8 |

Weights sum to 1.0. The weighted score maps to a continuous axis where tier boundaries partition the space.

Multilingual Support

Every keyword list includes translations in 9 languages (EN, ZH, JA, RU, DE, ES, PT, KO, AR). A Chinese user asking "证明这个定理" triggers the same reasoning classification as "prove this theorem."

Confidence Calibration

Raw tier assignments can be ambiguous when a score falls near a boundary. We use sigmoid calibration:


python
confidence = 1 / (1 + exp(-steepness * distance_from_boundary))

Where steepness = 12 and distance_from_boundary is the score's distance to the nearest tier boundary. This maps to a [0.5, 1.0] confidence range. Below threshold = 0.7, the request is classified as ambiguous and defaults to MEDIUM.

Agentic Detection

A separate scoring pathway detects agentic tasks (multi-step, tool-using, iterative). When agenticScore >= 0.5, the router switches to agentic-optimized tier configs that prefer models with strong instruction following (Claude Sonnet for complex tasks, GPT-4o-mini for simple tool calls).

Step 5: Tier-to-Model Mapping

Once a request is classified into a tier, the router selects from 4 routing profiles:

Auto Profile (Default)

Tuned from our benchmark data + user retention metrics:


plaintext
SIMPLE  → gemini-2.5-flash (1,238ms, IQ 20, 60% retention)
MEDIUM  → kimi-k2.5 (1,646ms, IQ 47, strong tool use)
COMPLEX → gemini-3.1-pro (1,609ms, IQ 57, fastest flagship)
REASON  → grok-4-1-fast-reasoning (1,454ms, $0.20/$0.50)

Eco Profile

Ultra cost-optimized. Uses free/near-free models:


plaintext
SIMPLE  → nvidia/gpt-oss-120b (FREE)
MEDIUM  → gemini-2.5-flash-lite ($0.10/$0.40, 1M context)
COMPLEX → gemini-2.5-flash-lite ($0.10/$0.40)
REASON  → grok-4-1-fast-reasoning ($0.20/$0.50)

Premium Profile

Best quality regardless of cost:


plaintext
SIMPLE  → kimi-k2.5 ($0.60/$3.00)
MEDIUM  → gpt-5.3-codex ($1.75/$14.00, 400K context)
COMPLEX → claude-opus-4.6 ($5.00/$25.00)
REASON  → claude-sonnet-4.6 ($3.00/$15.00)

Fallback Chains

Each tier config includes an ordered fallback list. When the primary model returns a 402 (payment failed), 429 (rate limited), or 5xx, the proxy walks the fallback chain. Fallback ordering is benchmark-informed:


typescript
// COMPLEX tier — quality-first fallback order
fallback: [
  "google/gemini-3-pro-preview",      // IQ 48, 1,352ms
  "google/gemini-3-flash-preview",     // IQ 46, 1,398ms
  "xai/grok-4-0709",                   // IQ 41, 1,348ms
  "google/gemini-2.5-pro",             // 1,294ms
  "anthropic/claude-sonnet-4.6",       // IQ 52, 2,110ms
  "deepseek/deepseek-chat",            // IQ 32, 1,431ms
  "google/gemini-2.5-flash",           // IQ 20, 1,238ms
  "openai/gpt-5.4",                    // IQ 57, 6,213ms — last resort
]

The chain descends by quality first (IQ 48 → 46 → 41), then trades quality for speed. GPT-5.4 is last despite having IQ 57, because its 6.2s latency is a worst-case user experience.

Step 6: Context-Aware Filtering

The fallback chain is filtered at runtime based on request properties:

Context window filtering: Models with insufficient context window for the estimated total tokens are excluded (with 10% safety buffer)
Tool calling filter: When the request includes tool definitions, only models that support function calling are kept
Vision filter: When the request includes images, only vision-capable models are kept

If filtering eliminates all candidates, the full chain is used as a fallback (better to let the API error than return nothing).

Cost Calculation and Savings

Every routing decision includes a cost estimate and savings percentage against a baseline (Claude Opus 4.6 pricing):


typescript
savings = max(0, (opusCost - routedCost) / opusCost)

For a typical SIMPLE request (500 input tokens, 256 output tokens):

Opus cost: $0.0089 (at $5.00/$25.00 per 1M tokens)
Gemini Flash cost: $0.0008 (at $0.30/$2.50 per 1M tokens)
Savings: 91.0%

Across our user base, the median savings rate is 85% compared to routing everything to a premium model.

Performance

The entire classification pipeline (14 dimensions + tier mapping + model selection) runs in under 1ms. No external API calls. No LLM inference. Pure keyword matching and arithmetic.

We originally designed a two-stage system where low-confidence rules-based classifications would fall back to an LLM classifier (Gemini 2.5 Flash). In practice, the rules handle 70-80% of requests with high confidence, and the remaining ambiguous cases default to MEDIUM — which is the correct conservative choice.

What We Learned

Speed and intelligence are weakly correlated. The fastest model (Grok 4 Fast, IQ 23) is at the bottom of the quality scale. The smartest model at low latency (Gemini 3.1 Pro, IQ 57, 1.6s) is a Google model, not OpenAI.
Optimizing for one metric fails. Latency-first routing breaks quality. Quality-first routing breaks latency budgets. You need multi-objective optimization.
User retention is the real metric. Our best-performing model for SIMPLE tasks isn't the cheapest or the fastest — it's Gemini 2.5 Flash (60% retention rate), which balances speed, cost, and just-enough quality.
Fallback ordering matters more than primary selection. The primary model handles the happy path. The fallback chain handles reality — rate limits, outages, payment failures. A well-ordered fallback chain is more important than picking the perfect primary.
Rule-based classification is underrated. 14 keyword dimensions with sigmoid confidence calibration handles 70-80% of requests correctly in <1ms. The remaining 20-30% default to a safe middle tier. For a routing system where every millisecond of overhead compounds across millions of requests, avoiding LLM inference in the classification step is worth the reduced accuracy.

Appendix: Full Benchmark Data

Raw data (46 models, latency, throughput, IQ scores, pricing): benchmark-merged.json
Routing configuration: src/router/config.ts
Scoring implementation: src/router/rules.ts

X/Twitter Algorithm: How AI Agents Can Hack Organic Reach

1bcMax — Sat, 21 Mar 2026 02:43:30 +0000

Most people blame the algorithm when their content doesn't perform. "The algo buried me." "Reach is dead." "Only paid promotion works now."

They're wrong. The algorithm isn't hiding your content — it's optimizing for attention. Learn to speak its language and it becomes the most powerful distribution system ever built.

What the Algorithm Actually Wants

Every platform optimizes for the same thing: time on platform. The algorithm promotes content that keeps people scrolling, clicking, watching.

Which means: it's not personal. It's math. And math can be reverse-engineered.

The Signals That Matter

Early engagement velocity. How fast do people interact after you post? First 30 minutes matter most. The algorithm uses early signals to predict total reach.

Dwell time. Do people stop scrolling and actually read? Long-form text, compelling hooks, narrative tension — these increase time-on-post, which signals quality.

Reply ratio. Comments beat likes. Replies beat retweets. The algorithm weighs active engagement (typing) over passive engagement (clicking).

Save and share. When someone bookmarks or DMs your post, that's signal gold. It means the content has reference value beyond the feed.

Reverse Engineering Virality

Look at what's already working in your niche. Not to copy — to decode. What format? What hook pattern? What time of day? What triggers replies?

The algorithm isn't random. It's a feedback loop. Give it what it rewards.

The Meta Move

Most creators fight the algorithm. Smart creators partner with it. They treat reach like engineering, not art. Test, measure, iterate.

Your content might be brilliant. But if it doesn't trigger the right signals in the first 30 minutes, brilliance is irrelevant.

The algorithm isn't your enemy. It's just math looking for signals. Give it the signals it wants, and it becomes the best distribution partner you've ever had.

Learn the cheat code.

MiniMax M2.7 Is Live on BlockRun — The First Self-Evolving Reasoning Model

1bcMax — Sat, 21 Mar 2026 02:37:44 +0000

MiniMax just dropped M2.7 — and it's live on BlockRun right now.

One API call. Pay per request. No subscription. No API key signup with MiniMax.

curl https://blockrun.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax/minimax-m2.7",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

If you're still calling minimax/minimax-m2.5, it auto-redirects to M2.7. No code changes needed.

What Makes M2.7 Different

M2.7 is the first model MiniMax describes as deeply participating in its own evolution. It doesn't just run agent tasks — it builds and optimizes its own agent harnesses through recursive self-improvement loops.

In practice, that means:

97% skill adherence across 40+ complex skills (each exceeding 2,000 tokens)
30% performance gains from recursive harness optimization over 100+ iteration cycles
Handles 30–50% of research workflows autonomously

This isn't a chatbot upgrade. It's a model that gets better at being an agent the more you use it as one.

Benchmarks That Matter

Software Engineering

Benchmark	M2.7	Context
SWE-Pro	56.22%	Matches GPT-5.3-Codex
VIBE-Pro	55.6%	End-to-end project delivery
Terminal Bench 2	57.0%	Complex engineering systems
SWE Multilingual	76.5	Cross-language code tasks
Multi SWE Bench	52.7	Multi-repo engineering

Machine Learning & Research

MLE Bench Lite (22 Kaggle-style competitions): 66.6% average medal rate — second only to Opus 4.6 (75.7%) and GPT-5.4 (71.2%). Best single run: 9 gold, 5 silver, 1 bronze.

Professional Productivity

Benchmark	M2.7	Context
GDPval-AA ELO	1495	Highest among open-source models
Toolathon	46.3%	Tool use accuracy
MM Claw	62.7%	Near Sonnet 4.6 level

Production debugging benchmarks show incident recovery time under 3 minutes — SRE-level decision-making for log analysis, security audits, and system comprehension.

New in M2.7 vs M2.5

Native Agent Teams — multi-agent collaboration built into the model, not bolted on
Recursive self-improvement — the model optimizes its own harnesses over iteration cycles
Character consistency — dramatically improved emotional intelligence for interactive apps
Financial analysis — deep reasoning over complex financial documents and reports

Pricing on BlockRun

	Price
Input	$0.30 / 1M tokens
Output	$1.20 / 1M tokens
Context window	204,800 tokens

That's 50x cheaper than Claude Opus and 12x cheaper than GPT-5.4 for output tokens — while matching their engineering benchmarks.

Pay per request with USDC on Base. No API key. No subscription. No minimum spend.

Try It Now

Direct API:

https://blockrun.ai/v1/chat/completions

Python SDK:

from blockrun import BlockRun
client = BlockRun()
response = client.chat("minimax/minimax-m2.7", "Explain this codebase")

TypeScript SDK:

import { BlockRun } from "blockrun";
const client = new BlockRun();
const response = await client.chat("minimax/minimax-m2.7", "Explain this codebase");

ClawRouter (drop-in OpenAI replacement for any framework):

export OPENAI_BASE_URL=https://blockrun.ai/v1
# Works with OpenClaw, LangChain, CrewAI, AutoGen — any OpenAI-compatible client

Read the full announcement from MiniMax: MiniMax M2.7 — Beginning the Journey of Recursive Self-Improvement

x402 Protocol: How AI Agents Pay for APIs Without Human Intervention

1bcMax — Sat, 21 Mar 2026 02:37:43 +0000

We built the internet for humans. Browsers, buttons, billing forms. It assumes a person is on the other end.

But now AI agents are the ones browsing. Reading. Executing. And they can't enter their credit card into a checkout form. They need to call APIs. And those APIs need to get paid.

This is where things break.

The Subscription Trap

Most APIs charge monthly. $20/month. $99/month. $500/month. The model assumes you're a company with a billing department, running a predictable workload.

But an autonomous agent doesn't have "monthly." It has "right now." It needs to call this API, get this data, generate this image — once, immediately, and move on.

Subscriptions force agents into a human billing model that doesn't fit their usage pattern. Result: agents either overpay (subscribed to services they rarely use) or can't access services at all (no budget for the subscription).

The API Key Nightmare

To use 10 services, you need 10 API keys. Each key requires: an account, email verification, billing setup, rate limit management, key rotation, secret storage.

For a developer building one product? Annoying but manageable.

For an autonomous agent that needs to discover and use services dynamically? Impossible.

The agent can't create accounts. It can't fill out forms. It needs a wallet it can hold, and services that accept that wallet directly.

What Actually Works

The agent holds a wallet. The service publishes a price. The agent calls the service. The payment happens atomically — $0.001 for this request, settled instantly, no subscription, no account.

This is what x402 enables. HTTP payments. Service returns 402 Payment Required. Agent signs payment. Service delivers. Done.

No billing dashboard. No monthly invoice. No API key management. Just wallets paying for compute, per request, at internet speed.

The New Default

The agent economy needs new infrastructure. Not "traditional payments with a crypto wrapper." Native machine-to-machine commerce.

Wallet = identity. Request = payment. No humans required.

This is what we're building at BlockRun. The payment and discovery layer for agents that need to find, evaluate, and pay for services — autonomously.

The internet wasn't built for machines to pay. So we're building the layer that is.

AI API Cost Control: How x402 Prevents $47K Budget Overruns

1bcMax — Sat, 21 Mar 2026 02:27:06 +0000

A multi-agent system mistakenly burned $47,000+ in API costs. No hacker. No breach. Just bad infrastructure controls.

Two AI agents were stuck in a recursive loop for 11 days, each one asking the other for clarification, each one convinced it was making progress. Nobody noticed until the invoice arrived.

If you're building with LLMs today, this is not an edge case. It's a problem many teams will eventually face. This is what's referred to as an agent loop problem, and it exposes a deeper issue with AI infrastructure.

These agents were handed API keys — the equivalent of giving them corporate credit cards — with no real-time spending governance. When the loop started, nothing existed at the infrastructure layer to stop it.

The good news: Edge & Node has built an open-source system called ampersend that makes this type of failure impossible.

With ampersend, every LLM call becomes a real USDC payment with spending limits enforced at the wallet level instead of application code. When the agent's budget runs out, the agent stops spending money — even if the code keeps running.

Agent Loops Are an Infrastructure Problem, Not a Code Problem

Most teams building agent systems know the usual advice: add step limits, set token caps, monitor for repeated outputs. These are good best practices — but they're not enough.

Step limits don't survive composition. Agent A calls Agent B, which calls Agent C. Step limits are local to each agent. If each agent is allowed 50 steps, the system can easily execute 150 total. When recursive calls are involved, costs compound quickly.

Token caps are estimates, not enforcement. Most LLM APIs let you set max_tokens on a response. This limits output length, not spending. An agent that sends 50 requests with modest outputs can still accumulate serious spend.

Monitoring is reactive. Observability dashboards tell you what happened. By the time you see a cost spike, the money has already been spent. In the $47K incident, monitoring was in place — it simply reported outcomes rather than intervening.

Application-level budget checks can be bypassed. If your code checks a counter before each API call, that counter lives in the same trust domain as the agent. A bug that causes the loop can also break the counter.

In other words, anything that depends on the agent's own logic to limit its spend will fail in exactly the scenarios where limits matter most: when the agent is misbehaving. You need a control layer that is external to the agent, that can't be circumvented by application bugs, and that enforces hard economic boundaries on every single request.

The Solution: Make Every LLM Call a Payment

The budget problems above share a root cause: payment and execution are decoupled. The x402 protocol addresses this by redefining how agents access LLM inference. Instead of authenticating with an API key and settling costs later via an invoice, each request is a discrete payment transaction.

BlockRun is a platform that enables pay-per-use access to many mainstream LLMs via the x402 payment protocol. No API key. No subscription tier. No monthly bill. Each request either pays or it doesn't execute.

This is a fundamental shift. With API keys, spending authority is granted once and revoked manually. With x402, spending authority is exercised and verified on every single request. If the payment doesn't go through, the inference doesn't happen.

Introducing ampersend: The Wallet That Enforces Your Budget

Pay-per-request alone doesn't prevent runaway spending — an agent stuck in a loop will keep paying as long as it has funds. This is the gap ampersend was built to address.

ampersend is agentic payment infrastructure that gives autonomous agents programmable wallets with built-in spending controls and real-time observability. When an agent requests a payment signature:

If the agent's daily spend is under the limit, the wallet signs the transaction and the request proceeds.
If the daily spend has reached the limit, the wallet refuses to sign. The request fails. The agent is economically dead — it can keep running, but ampersend won't let it pay for anything.

The spending limit is not in the application code. It lives in the wallet policy. The agent's code cannot override it, bypass it, or accidentally skip it. Even if the agent is stuck in an infinite loop, prompt-injected, or broken by orchestration bugs, the wallet remains the final authority.

How It All Works Together

Agent sends an inference request to BlockRun.
BlockRun responds with HTTP 402 Payment Required with payment details.
The agent's ampersend treasurer checks the request against the wallet's spending policy. If allowed, it signs a USDC payment. If the limit is reached, it refuses — request dies here.
The agent retries the request with proof of payment attached.
BlockRun verifies the on-chain payment and returns the inference result.

Traditional API vs. BlockRun + ampersend

Traditional API	BlockRun (x402)
API key authentication	Payment is the authentication
Post-hoc billing (monthly invoice)	Pre-paid per request (instant settlement)
Spending limit = credit card limit	Spending limit = wallet policy
Revocation requires key rotation	Revocation is automatic (wallet limit)
Cost attribution is manual	Cost is on-chain and auditable

For agent builders, this means you can give an agent access to GPT-class models without giving it an API key that could be leaked, shared, or exploited beyond your intended budget.

Does It Actually Stop Runaway Spending?

We built a load test that deliberately simulates a disaster scenario — firing requests in an infinite loop as fast as possible until something stops it.

With a traditional API key, nothing stops it. The loop runs until the credit card is maxed out or someone manually intervenes.

With ampersend: the first N requests succeed. Each one is a real USDC payment. When the agent's daily limit is reached, the treasurer refuses to sign the next payment. The total spend is exactly the daily limit you configured — not a dollar more.

The loop may continue logically — the code still wants to send requests — but financially, it's dead. The wallet, not the code, is the circuit breaker.

Why This Matters for Agent Builders

If you're building systems where AI agents call LLM APIs — whether that's a single coding agent, a multi-agent pipeline, or an autonomous agent swarm — the loop spending problem will eventually find you.

The shift is simple:

Replace API keys with per-request payments. x402 makes every LLM call an explicit economic transaction.
Enforce budgets at the wallet layer, not the application layer. ampersend's spending limits can't be bypassed by bugs in your agent code.
Make costs on-chain and auditable. Every payment is a USDC transaction, visible on-chain. No more guessing where the spend went.

This isn't about crypto ideology. It's about using programmable money to solve a real engineering problem: how do you give an autonomous system access to expensive resources without giving it unlimited spending authority?

The answer is the same one that every other infrastructure domain has learned: governance belongs at the platform layer, not the application layer. Kubernetes doesn't trust your containers to self-limit CPU usage. Rate limiters don't trust your services to self-throttle. Your agent infrastructure shouldn't trust your agents to self-budget.

Try It Yourself

The full reference implementation is open source:

Repository: ampersend-blockrun-agentops
ampersend SDK: github.com/edgeandnode/ampersend-sdk
BlockRun: blockrun.ai
x402 Protocol: x402.org

Request beta access at ampersend.ai.