BeanBean

Posted on May 20 • Originally published at nextfuture.io.vn

Terminal Coding CLI Ecosystem: 8 May 2026 Reports Aggregated

#fullstack #ai #webdev #javascript

Originally published on NextFuture

Between May 8 and May 20, 2026, eight engineering posts and benchmark reports landed on terminal coding CLI agents — Claude Code, Codex CLI, Gemini CLI, and GitHub Copilot CLI. Across those eight sources the spread is large: one toolkit scores 80 out of 100 on its own task suite, a Llama 3.2 self-host reports running at 1/160th the API cost it replaced, and the published pricing of frontier models still varies by more than 10× per million tokens. This post aggregates the numbers and the methodologies behind them so you can choose between these four CLIs without trusting a single vendor chart.

TL;DR: the numbers

DimensionClaude CodeCodex CLIGemini CLICopilot CLISources

LicenseProprietaryApache 2.0Apache 2.0Proprietary (GitHub)2 reports
ImplementationTypeScriptTypeScriptTypeScriptTypeScript / Node1 report
Default modelClaude Opus / Sonnet 4.xGPT-5.xGemini 2.x → 3.5 FlashGPT-5.x + Copilot routing3 reports
Frontier price ($ / 1M out tokens)~$15.00 (Opus 4.7 tier)~$10.00 (GPT-5.5 tier)Gemini 3.5 Flash ≪ frontierFlat plan + per-request gated2 reports
Skill / extension ecosystemSkills, MCP, /advisorMCP, tools, SkillsMCP, toolsGitHub-native tools3 reports
Self-host alternative cost reference$12,000/mo → $5/mo cited as 1/160×———1 report
Independent benchmark scoreIncluded in oh-my-agent v2 suite (80/100)IncludedIncludedDiscussed qualitatively2 reports

Each cell aggregates at least one engineering report published between May 8 and May 20, 2026. Numbers in the price row are reported list prices for the cited frontier tiers — actual CLI billing depends on the plan and routing layer used.

How this comparison was assembled

The starting set was the nextfuture.io.vn article feed, filtered to posts mentioning at least one of the four CLIs plus a measurement keyword (benchmark, latency, price, throughput, accuracy, or failure mode). Eight sources survived the screen: two cover the terminal CLIs in a feature matrix, three cover specific tools at depth, two cover model pricing changes that the CLIs inherit, and one covers a self-host alternative.

Inclusion: published May 8–20, 2026, with at least one specific number (price per 1M tokens, benchmark score, request volume, latency target) or a primary-source feature matrix.
Exclusion: vendor marketing pages, model release announcements without independent measurement, demo videos, single-anecdote tweets, and posts re-syndicating Anthropic, OpenAI, or Google content without new measurements.
Normalization: token prices stated as $/1M input and $/1M output. Self-host claims are cited but never blended with API list prices — a $5/month VPS cannot be compared to API tokens without a workload qualifier.

All eight sources are listed at the bottom with the metric each contributed.

Feature matrix: where the four CLIs actually differ

The cleanest side-by-side comes from pardnchiu's Agenvoy matrix on dev.to, which rows all three foundation-model CLIs against two open-source competitors. The differences that matter for buyers are not the language (all three are TypeScript) or the architecture (all three are session-based CLI processes). They are the licensing model, the default model routing, and the agent-skill ecosystem.

Claude Code is the only proprietary entry of the three foundation CLIs. Codex CLI and Gemini CLI both ship under Apache 2.0, which means the surface area — the prompt scaffolding, the tool definitions, the loop — is auditable and forkable. That distinction shows up in the cryptographic forensics post: when the harness is open you can verify what the agent actually saw before it ran rm -rf on training data. With Claude Code the JSONL session log is the only artifact, and a third party who doesn't trust your machine cannot independently verify it. None of the four CLIs ship signed session logs by default in May 2026.

Copilot CLI sits in its own quadrant. It is the only one of the four that is plan-priced rather than per-token, and the only one with a credible PR-triage use case at scale — one developer reports running it across 40+ upstream organizations for 18 months. That is not a benchmark, it is an existence proof, and the other three CLIs lack a published equivalent.

Benchmarks and cost: what numbers actually exist

The most-quoted benchmark for the foundation CLIs this month is the oh-my-agent v2 score of 80/100. Read carefully: 80/100 is the toolkit's score on its own task suite, with Cursor promoted to a first-class vendor and nine new skills added in v2. It is not a head-to-head between Claude Code, Codex CLI, and Gemini CLI — it is one harness running across whichever model the user wires up. Treat it as a proxy for "do the skills + the model close the lockfile-mismatch class of failures," not a model leaderboard.

Pricing for the underlying models, which the CLIs inherit unless an /advisor-style router intervenes, moved this month. The Token Ledger on May 19 reports NVIDIA Nemotron 3 Super completion at $0.45/1M (down from $0.50, a 10% cut), Gemma 4 26B A4B at $0.06/$0.33 per 1M prompt/completion, gpt-oss-120b at $0.039/$0.18, and Mistral Nemo trending down on completion. Claude Opus and GPT-5.5 sit roughly an order of magnitude above gpt-oss-120b on completion. The GPT-5.5 vs Claude Opus 4.7 comparison confirms the spread but does not publish reproducible SWE-bench task IDs.

The most aggressive cost claim is the Llama 3.2 + Ollama + Nginx deployment on a $5/month DigitalOcean droplet, framed as "1/160th Claude cost" after a $12,000 Anthropic bill. The post reports 50+ requests per second at sub-100ms latency on a load-balanced multi-instance setup — but Llama 3.2 8B at sub-100ms is not running SWE-bench tasks at Opus quality, and the workload being replaced is summarization, not multi-step coding agents.

When the headline number lies

The 80/100 benchmark gets quoted as if it ranks the CLIs. It does not. oh-my-agent v2 is a harness that adds skills around a model: the same Claude Sonnet 4.x that scores in that harness will score differently under Codex CLI's scaffolding, and Gemini 3.5 Flash uses a different tool-call protocol entirely. The "1/160th cost" claim has the same shape — it compares a self-hosted Llama 3.2 8B running summarization against an Anthropic bill that included multi-step agent runs on Opus. Neither headline is wrong; both are non-transferable. Treat the matrix above as the lower-rigor floor and A/B for procurement.

Verdict by builder profile

Solo dev shipping side projects: Claude Code with the Sonnet tier, or Copilot CLI on the flat plan. The Copilot flat plan removes the cost-anxiety tax that order-of-magnitude per-token differences create on side-project budgets.
Team of 5-20 with budget pressure: Codex CLI under Apache 2.0 plus a router (an /advisor-style or AI-gateway layer) to push routine tasks to gpt-oss-120b at $0.039/$0.18 per 1M and reserve GPT-5.x for the harder runs. The open license matters because you can audit the harness when the agent does something destructive.
Cost-sensitive batch workload: Look at the $0.45/1M Nemotron 3 Super and $0.06/$0.33 Gemma 4 26B tier reported by The Token Ledger, and consider whether the workload is actually CLI-shaped or whether a self-host on Llama 3.2 + Ollama clears the latency bar. The 1/160× claim only works if the work is summarization or classification.
Latency-critical user-facing app: None of the four CLIs fit — they are session-based developer tools, not SDKs. For sub-100ms responses, follow the Llama-on-DigitalOcean pattern or a Gemini 3.5 Flash endpoint.
Open-source maintainer triaging 40+ repos: Copilot CLI is the only one of the four with a published existence proof at that scale. The other three lack equivalent reports.

Sources reviewed

Claude Code · Codex CLI · Gemini CLI · OpenClaw · Hermes Agent vs Agenvoy — dev.to, May 19, 2026, contributed: language / license / author / architecture matrix.
oh-my-agent v2: Nine New Skills, First-Class Cursor, and an 80/100 Benchmark — dev.to, May 20, 2026, contributed: 80/100 toolkit benchmark, Cursor first-class promotion, nine-skill list.
The Token Ledger – 2026-05-19 — dev.to, May 19, 2026, contributed: per-model price deltas ($0.45/1M Nemotron 3 Super, $0.06/$0.33 Gemma 4 26B A4B, $0.039/$0.18 gpt-oss-120b).
GitHub Copilot CLI as a PR-triage co-pilot — dev.to, May 19, 2026, contributed: 40+ upstream orgs, 18-month single-developer program scope.
Llama 3.2 + Ollama + Nginx on a $5/month DigitalOcean droplet — dev.to, May 20, 2026, contributed: $12,000/mo → $5/mo claim, 50+ req/s, sub-100ms latency.
Cryptographic Forensics for AI Coding Agent Sessions — dev.to, May 20, 2026, contributed: JSONL session log gap, harness-transparency argument for open licenses.
GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, and Benchmarks — dev.to, May 19, 2026, contributed: frontier-tier pricing band and qualitative speed comparison.
Agentic app coding gets an upgrade with Google's release of Android CLI — TechCrunch, May 19, 2026, contributed: Google Android CLI integration target for Claude Code and Codex.

FAQ

Did I run these benchmarks myself?

No. This post aggregates eight reports published between May 8 and May 20, 2026. Each cell in the TL;DR table cites at least one independent source, and most cells cite two. The synthesis is the work; the measurements are other people's.

Why aggregate instead of running my own?

Single benchmarks lie — workload mismatch, version drift, cherry-picked task set, vendor framing. The 80/100 oh-my-agent score and the 1/160× Llama claim are both real numbers that don't generalize. Aggregating eight reports surfaces the median behavior, the spread, and the boundary conditions where each number stops being true. For more on how coding agents fail in practice, see 9 Ways AI Coding Agents Break in Production (May 2026).

How current is this?

All eight sources published between May 8 and May 20, 2026. Tool versions cited: Claude Code (Sonnet 4.x / Opus 4.7 routing), Codex CLI (GPT-5.x), Gemini CLI (Gemini 2.x → 3.5 Flash), Copilot CLI (May 2026 plan). Expect staleness by September 2026 — model pricing moves monthly, as May 2026's Cursor-to-Claude-Code math already showed.

This article was originally published on NextFuture. Follow us for more fullstack & AI engineering content.

DEV Community