DEV Community: Joske Vermeulen

AI Dev Weekly #13: Microsoft Declares Independence — 7 In-House Models, Kills Claude Code, RTX Spark Dev Box

Joske Vermeulen — Thu, 04 Jun 2026 14:11:45 +0000

AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news, with my take as someone who actually uses these tools daily.

Microsoft Build 2026 was the story of the week. Seven in-house AI models — none trained on OpenAI data. A Surface mini PC with NVIDIA RTX Spark inside. Claude Code licenses cancelled, developers pushed to Copilot. This was Microsoft saying, loud and clear: we don't need OpenAI anymore. Meanwhile MiniMax dropped the first open-weight frontier multimodal model, and NVIDIA unveiled hardware that makes local AI actually practical.

1. Microsoft Build 2026: 7 models, zero OpenAI dependency

Microsoft unveiled its MAI (Microsoft AI) model family at Build on June 2. The headline: MAI-Thinking-1 — a 35B reasoning model trained entirely on commercially licensed enterprise data with no distillation from GPT or any OpenAI model. It matches Claude Sonnet 4.6 on key benchmarks at up to 10× better cost efficiency.

Other MAI models announced:

MAI-Code-1-Flash (5B) — Purpose-built for GitHub Copilot and VS Code
MAI-Thinking-1 (35B) — Reasoning, multi-step instructions, long context
Aion 1.0 Instruct — Local Windows model for on-device reasoning
Aion 1.0 Plan — Local Windows model for planning and tool use
Plus 3 more across transcription, speech, and images

Also announced: Windows as "agent-native runtime" with Microsoft Execution Containers (MXC) — sandboxed environments for running AI agents with enterprise-grade isolation.

My take: This is the biggest signal yet that the OpenAI-Microsoft marriage is evolving into a polite separation. MAI-Thinking-1 being trained without any OpenAI data is deliberate positioning — Microsoft can now say "our models have no license entanglement with OpenAI." The 5B coding model (MAI-Code-1-Flash) specifically targets Copilot — Microsoft's most important developer product. They're replacing GPT inside their own tools with models they fully control.

2. Microsoft ends Claude Code licenses

Forbes reported that Microsoft is ending Claude Code licenses and pushing developers to its own Copilot CLI instead. The subtext: Microsoft no longer wants to rent Anthropic's intelligence inside its own products.

This affects developers at Microsoft who were using Claude Code as their primary coding tool. They're being migrated to Copilot powered by MAI-Code-1-Flash.

My take: If you work at a Microsoft shop: prepare for Copilot to get much better (MAI-Code-1 is purpose-built for it). If you use Claude Code independently: nothing changes for you. But this signals that the era of "one AI provider to rule them all" is over. Every major tech company is building their own models now.

3. NVIDIA RTX Spark: the hardware that makes local AI real

At Computex (June 1), NVIDIA unveiled RTX Spark — a new Windows PC superchip with 128GB unified memory, ARM CPU, Blackwell GPU, 1 petaflop of AI compute. It runs 120B parameter models locally.

Then at Microsoft Build (June 2), Microsoft announced the Surface RTX Spark Dev Box — a mini PC with RTX Spark inside, preloaded with VS Code, GitHub Copilot, WSL2 with GPU passthrough, CUDA, Python, Git, and Node.js. It's purpose-built for AI developers.

Key numbers:

128GB unified memory (run Qwen 3.6 27B at 2× speed, Llama 4 Scout, etc.)
100W sustained thermal design in aluminium chassis
Ships with Windows 11 Pro + full dev stack preinstalled
Available this fall alongside consumer RTX Spark laptops

My take: This is the machine I've been wanting for the AI Startup Race. Currently our agents run on a $40/mo VPS. The Surface RTX Spark Dev Box would let you run 120B models locally with zero API costs. For developers spending $100+/month on AI APIs, this hardware pays for itself fast. See our full RTX Spark guide and vs Mac Studio comparison.

4. MiniMax M3: first open-weight frontier multimodal

MiniMax M3 launched June 1. First open-weight model combining:

59% SWE-bench Pro (beats GPT-5.5's 58.6%)
1M token context via MSA (15.6× faster than standard attention)
Native text + images + video input
Computer use (desktop operation)
$0.60/$2.40 per million tokens
Weights dropping ~June 10

It reproduced an ICLR 2025 paper autonomously — 12 hours, 18 commits, 23 figures, zero human intervention.

My take: M3 is the model that should worry Anthropic most. It beats GPT-5.5 on coding while being open-weight, multimodal, and 10× cheaper than Opus. When weights drop (~June 10), enterprises with data privacy requirements get a frontier model they can run on-premise. That was previously impossible without Claude's closed API. See our M3 complete guide, vs Claude Opus 4.8, and vs DeepSeek V4-Pro.

5. Grok's next model teased (mid-June)

xAI teased their next Grok model — reportedly tripling the parameter count with a focus on coding leadership. Expected release: mid-June 2026. Supervised fine-tuning was complete as of the announcement, with reinforcement learning underway.

If it delivers on the coding promise, it could shake up the Grok Build ecosystem significantly. Currently Grok Build uses Grok 4.3 — an upgrade to whatever this new model is could make the $30/mo subscription much more competitive.

6. Quick hits

OpenAI Codex on Amazon Bedrock — GPT models + Codex now generally available on AWS with enterprise controls. OpenAI meeting customers where they already deploy.
White House signed frontier-model cyber order — Regulation incoming for the most capable AI models used in cybersecurity applications.
Anthropic expanded Project Glasswing — More organizations getting access to Mythos Preview for cybersecurity work. Broader release "in weeks."
NVIDIA Nemotron 3 Ultra free on OpenRouter — NVIDIA's own model now available at zero cost via OpenRouter.

What I'm watching next week

MiniMax M3 weights (~June 10) — Will they live up to the benchmarks when the community tests them?
Grok new model (mid-June) — Does tripling parameters actually improve coding?
Gemini CLI sunset (June 18) — Two weeks left. Migrate now if you haven't.
Anthropic Mythos — Still "coming in weeks." Will it ship before the Grok model does?
Our race agents — Xiaomi hit session 456. 605 users/week on its site. Zero revenue still. 5 weeks left. Follow along →

AI Dev Weekly publishes every Thursday. Subscribe for weekly race updates and AI developer news.

Originally published at https://www.aimadetools.com

AI and GDPR — What Developers Actually Need to Know (2026)

Joske Vermeulen — Tue, 02 Jun 2026 12:46:18 +0000

If you're a developer at an EU company using Claude Code, Cursor, or any AI coding tool — your company may be violating GDPR without knowing it. Every prompt you send is data that gets processed on someone else's servers.

Here's what you actually need to know.

The core problem

When you use an AI coding tool, your code travels to external servers. If that code contains:

Personal data (user emails, names, addresses in test fixtures)
Database schemas with PII fields
API keys or credentials
Customer data in config files

...then you're transferring personal data to a third-party processor. Under GDPR, that requires a legal basis, a Data Processing Agreement (DPA), and potentially a Transfer Impact Assessment if the data leaves the EU.

Which AI tools are GDPR compliant?

Tool/Provider	DPA available?	EU data residency?	Training on your data?
Mistral API	✅ Yes	✅ EU-based	❌ No
Anthropic API (Claude)	✅ Yes (Team/Enterprise)	⚠️ US servers	❌ No (API)
OpenAI API	✅ Yes	⚠️ US servers (EU option available)	❌ No (API)
Google Vertex AI	✅ Yes	✅ EU region available	❌ No
Claude Pro subscription	❌ Consumer terms	❌ US	⚠️ May be used
ChatGPT Plus	❌ Consumer terms	❌ US	⚠️ May be used
Self-hosted	N/A (your servers)	✅ You control it	❌ No

Key distinction: API access (business terms, DPA available) is different from consumer subscriptions (personal terms, no DPA). If your company uses ChatGPT Plus or Claude Pro for work, that's a compliance risk.

The safest options for EU developers

Option 1: Self-hosted (zero data transfer)

Run models locally — nothing leaves your machine:

ollama pull devstral-small:24b
aider --model ollama/devstral-small:24b

See our self-hosted AI guide and Ollama guide.

Option 2: Mistral API (EU-native)

Mistral is based in Paris. Data stays in the EU by default. No transatlantic transfers, no Standard Contractual Clauses needed.

from mistralai import Mistral
client = Mistral(api_key="your-key")
# Data processed in EU infrastructure

See our Mistral API guide.

Option 3: US providers with DPA + EU region

Anthropic and OpenAI offer business plans with DPAs. Google Vertex AI lets you specify EU regions. This is compliant but requires paperwork.

What about AI coding tools?

Tool	GDPR-safe?	Why
Aider + local model	✅	Nothing leaves your machine
Continue.dev + Ollama	✅	Local inference
Aider + Mistral API	✅	EU data residency
Claude Code (Pro sub)	❌	Consumer terms, US servers
Cursor	⚠️	Business plan has DPA
GitHub Copilot Business	✅	DPA + no training on code

Practical steps

Audit your AI tools — list every AI service your team uses
Check for DPAs — consumer subscriptions don't count
Scrub test data — remove real PII from test fixtures and seed data
Consider self-hosting for sensitive codebases
Use Mistral as your default EU-compliant provider
Document everything — GDPR requires you to demonstrate compliance

FAQ

Do AI tools comply with GDPR?

It depends on the tool and plan. API-based services (OpenAI API, Anthropic API, Google Vertex AI) offer Data Processing Agreements and don't train on your data. Consumer subscriptions like ChatGPT Plus or Claude Pro use personal terms without DPAs and may not be GDPR-compliant for business use.

Can I use ChatGPT for GDPR-regulated data?

You can use the OpenAI API with a business agreement and DPA in place, but not the consumer ChatGPT Plus subscription. The API doesn't train on your data and offers EU data residency options, making it suitable for regulated workloads with proper legal agreements.

Do I need a DPA for AI APIs?

Yes, if you're sending any personal data to the API. Under GDPR, any third-party processing personal data on your behalf requires a Data Processing Agreement. Most major AI providers (OpenAI, Anthropic, Google) offer DPAs on their business and enterprise plans.

Is self-hosted AI GDPR compliant?

Self-hosted AI eliminates data transfer concerns since nothing leaves your infrastructure. However, you still need to comply with other GDPR requirements like data minimization, purpose limitation, and the right to erasure for any personal data the model processes or stores.

Related: Where Does Your Code Go? · Self-Hosted AI for GDPR · EU AI Act for Developers · Best AI Coding Agents for Privacy · Best VPNs for Developers · Uk Ai Regulation After Brexit · Ccpa Ai Developers

Originally published at https://www.aimadetools.com

AI Dev Weekly #12: Opus 4.8 Drops, Anthropic Hits $965B, Chinese AI Goes 99% Cheaper, Microsoft Builds Its Own Coding Model

Joske Vermeulen — Fri, 29 May 2026 07:09:56 +0000

AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news, with my take as someone who actually uses these tools daily.

The theme this week is divergence. US labs are raising prices and valuations. Chinese labs are racing to zero. Developers are caught in the middle choosing between the absolute best (Opus 4.8 at $25/M output) and "good enough at 3% of the cost" (DeepSeek/MiMo at $0.87/M). Meanwhile Microsoft is hedging by building its own coding model to reduce OpenAI dependency. Let's break it all down.

1. Claude Opus 4.8: the new #1 coding model

Anthropic released Claude Opus 4.8 on May 28. Same price as 4.7 ($5/$25 per million tokens), better at everything.

Key numbers:

69.2% SWE-bench Pro — up from 64.3% (4.7) and miles ahead of GPT-5.5 (58.6%)
74.2% Terminal-Bench 2.1 — +8.4 points over 4.7
88.6% SWE-bench Verified — highest of any model
4× fewer unflagged code flaws — the honesty improvement is the real story
61.4 Artificial Analysis Index — takes #1 from GPT-5.5

The biggest new feature is dynamic workflows in Claude Code. Claude can now plan a large task, spawn hundreds of parallel subagents, verify results, and iterate until convergence. Jarred Sumner used it to port Bun from Zig to Rust — 750,000 lines, 11 days, 99.8% test pass rate.

Other additions: effort control (low → max), fast mode at 3× cheaper ($10/$50 instead of $30/$150), and system messages mid-conversation in the API.

My take: I run Claude as one of seven agents in the $100 AI Startup Race. The 4.8 vs 4.7 improvement is immediately noticeable — fewer hallucinated progress claims, better self-correction, more efficient tool calling. The dynamic workflows feature is genuinely new territory. No other tool can spawn hundreds of coordinated agents from a single prompt. For codebase-scale migrations and audits, this is a step change. The question is whether the $25/M output price is justified when DeepSeek scores within 8 points for $0.87/M.

2. Anthropic raises $65B, surpasses OpenAI at $965B

Alongside the Opus 4.8 launch, Anthropic closed a $65 billion Series H at a $965 billion post-money valuation. That puts them above OpenAI for the first time.

The numbers:

$965B valuation (OpenAI was last valued at ~$900B)
$47B annualized revenue run rate — tripled in 3 months
Led by Altimeter Capital, Dragoneer, Greenoaks, Sequoia Capital
Mythos-class models (higher intelligence than Opus) coming "in weeks"

My take: The valuation flip is symbolic but the revenue growth is real. $47B run rate means Claude is generating serious enterprise revenue. The Mythos tease is interesting — they explicitly said it has "even higher intelligence than Opus" and is currently limited to cybersecurity work under Project Glasswing. If Mythos ships broadly in June, it could be another step change. For developers, the practical implication is: Anthropic has the resources to keep shipping fast. Expect monthly model updates to continue.

3. The Chinese AI pricing war goes nuclear

Two massive price cuts in one week made Chinese frontier models essentially free for cached workloads:

DeepSeek V4-Pro (May 22): The 75% promotional discount is now permanent. Output locked at $0.87/M tokens. Input at $0.435/M. Cache hits at $0.003625/M. This is a model that scores 80.6% on SWE-bench Verified — within 8 points of Opus 4.8.

MiMo V2.5 Pro (May 26): Xiaomi cut prices by up to 99%. Cached input dropped from $0.36/M to $0.0036/M. Standard pricing now matches DeepSeek exactly: $0.435/$0.87. Token Plans upgraded 5-51× (the $100 plan now gets 82 billion tokens).

The technical explanation: both labs achieved architectural breakthroughs in KV cache efficiency. DeepSeek's interleaved attention reduces cache to 10% of standard size. MiMo's hierarchical SWA uses a 1:7 sparsity ratio. Both claim break-even at these prices.

The result: Chinese AI models are now 30× cheaper than American equivalents on standard pricing, and 100×+ cheaper on cached workloads. For agent pipelines with stable system prompts, the effective cost is approaching zero.

My take: I tripled the Xiaomi agent's sessions in our race (from 2 to 6 per day) because the cost became negligible. At $0.0036/M cached tokens, running an autonomous agent 24/7 costs less than a cup of coffee per day. The quality gap is real but narrowing — DeepSeek V4-Pro at 80.6% SWE-bench vs Opus 4.8 at 88.6% is meaningful for hard tasks but irrelevant for 80% of routine coding. If you are spending more than $500/month on API calls and haven't tested Chinese models, you are leaving money on the table. We wrote a full migration guide if you want to try.

4. Microsoft building its own coding model for Build 2026

Reuters reported on May 28 that Microsoft will unveil a homegrown coding model at Build 2026 (June 2-3 in San Francisco). It is designed to boost GitHub Copilot and reduce dependency on OpenAI.

Also coming at Build:

Transcription model
Reasoning model
Speech model
Image model

This is part of a broader strategic shift. Microsoft is building a self-sufficient AI stack alongside its OpenAI partnership, not instead of it. The competitive pressure from Claude Code (which has been eating Copilot's market share) is the likely catalyst.

My take: This is the most significant signal yet that the OpenAI-Microsoft relationship is evolving. Microsoft investing billions in OpenAI while simultaneously building competing models tells you everything about where the industry is heading: no one wants to be dependent on a single provider. For developers using Copilot, this could mean better performance (a model optimized specifically for code completion rather than general-purpose) or it could mean fragmentation (yet another model to evaluate). Watch Build next week for details.

5. StepFun Step 3.7 Flash: 198B MoE at 400 tokens/sec

A new player entered the cheap-and-fast model tier. StepFun released Step 3.7 Flash — a 198B parameter MoE model that activates only 11B parameters per token.

The specs:

400 tokens/second — 2× faster than Gemini 3.5 Flash
256K context window
Native multimodal — text, images, video
3 reasoning tiers — Low/Medium/High per API call
Advisor Mode — achieves 97% of Opus 4.6 coding at $0.19/task
Open-weight — self-hostable on 128GB RAM
~$0.20/M input, ~$0.80/M output on OpenRouter

The unique feature is Advisor Mode: Step 3.7 Flash handles routine execution autonomously and only escalates to a stronger model when genuinely stuck. This automated routing achieves near-frontier quality at budget prices.

My take: The "Flash" model tier is getting crowded — Gemini 3.5 Flash, Step 3.7 Flash, DeepSeek V4 Flash. All under $1/M output, all fast enough for real-time use, all surprisingly capable. Step 3.7 Flash's video understanding and GUI interaction capabilities set it apart. The 400 t/s throughput is genuinely impressive for a model this capable. If you need speed + multimodal + cheap, this is worth testing.

6. Quick hits

OpenAI reportedly dropped GPT-5.3 Codex minutes after Anthropic's Opus 4.8 announcement. 25% faster than GPT-5.2, reportedly helped debug itself during development. Supercharges the Codex agentic coding tool launched earlier this month.
Cohere acquired Aleph Alpha — creating a $20B transatlantic sovereign AI company. Backed by Schwarz Group (Europe's largest retailer). Positions as the enterprise alternative to US/Chinese models for European companies with data sovereignty requirements.
Canada ruled OpenAI violated privacy laws — regulatory pressure continues to mount on US AI labs internationally.
Claude Code shipped detailed usage analytics — you can now see exactly how many tokens each session consumed, broken down by model and effort level.

What I'm watching next week

Microsoft Build (June 2-3) — The new coding model reveal. Will it compete with Claude Code or just improve Copilot's autocomplete?
Mythos timeline — Anthropic said "coming weeks." If it ships in early June, it could leapfrog Opus 4.8 immediately.
Gemini CLI shutdown (June 18) — Two weeks until the deadline. If you haven't migrated to Antigravity CLI, time is running out.
Our race agents — Claude is at 194 blog posts. Xiaomi just got tripled to 6 sessions/day. Gemini is back online after a 4-day auth outage. Follow along →

AI Dev Weekly publishes every Thursday. Subscribe for weekly race updates and AI developer news.

Originally published at https://www.aimadetools.com

Aider Complete Guide: Setup, Best Models, and Tips (2026)

Joske Vermeulen — Wed, 27 May 2026 12:24:08 +0000

Aider is a free, open-source AI pair programming tool that runs in your terminal. You chat with it in plain English, and it directly edits files in your local Git repository. Every change is automatically committed with a descriptive message.

It's the most model-flexible AI coding tool available — it works with Claude, GPT, Gemini, DeepSeek, Qwen, local models via Ollama, or any OpenAI-compatible API.

Why Aider?

Most AI coding tools lock you into one model. Claude Code only works with Claude. Codex CLI only works with GPT. Aider works with everything.

Key strengths:

Git-native — every AI edit is a clean Git commit you can review, revert, or cherry-pick
100+ model support — any model, any provider, including local
Repo map — understands your entire codebase structure, not just open files
Multi-file editing — coordinates changes across multiple files in one operation
100+ languages — not just Python and JavaScript

Installation

pip install aider-chat

Or with pipx (recommended for isolation):

pipx install aider-chat

Quick start

cd your-project
export ANTHROPIC_API_KEY="your-key"  # or OPENAI_API_KEY, etc.
aider

Aider starts in your project directory, scans the Git repo, and builds a map of your codebase.

Choosing a model

Aider supports tiered model usage — a powerful model for complex edits and a cheaper one for simple tasks:

# Use Claude Opus for main editing, Sonnet for quick tasks
aider --model claude-3-opus-20240229 --weak-model claude-3-sonnet-20240229

# Use GPT-5.4 
aider --model gpt-5.4

# Use DeepSeek (cheapest frontier option)
aider --model deepseek/deepseek-chat

# Use a local model via Ollama
aider --model ollama/qwen3.5:27b

# Use any model via OpenRouter
export OPENROUTER_API_KEY="your-key"
aider --model openrouter/anthropic/claude-opus-4.6

See our OpenRouter guide for the full model catalog.

Core commands

Adding files to context

/add src/auth.ts src/middleware.ts
/add src/routes/*.ts          # glob patterns work
/read docs/API.md             # read-only context (not edited)
/drop src/auth.ts             # remove from context

Editing modes

Aider has multiple edit formats optimized for different models:

whole — replaces entire files (best for smaller files)
diff — sends unified diffs (best for large files)
udiff — universal diff format
editor — opens your editor for manual review

aider --edit-format diff  # Use diff mode

Git integration

/commit           # Commit current changes
/undo             # Undo last AI edit (git revert)
/diff             # Show uncommitted changes
/git log --oneline -10  # Run any git command

Every AI edit creates an automatic commit like:

aider: Refactor auth middleware to use JWT validation

Running commands

/run npm test              # Run tests
/run npm run lint          # Run linter
/test npm test             # Run tests and auto-fix failures

The /test command is powerful — it runs your test suite, and if tests fail, Aider automatically tries to fix the code and re-runs until tests pass.

Web content

/web https://docs.stripe.com/api/charges  # Fetch docs for context

Advanced features

Repo map

Aider builds a map of your entire repository — function signatures, class definitions, imports, and dependencies. This means it understands how files relate to each other, even files not explicitly added to the chat.

The repo map uses tree-sitter for parsing, supporting 100+ languages.

Linting integration

aider --lint-cmd "eslint --fix"  # Auto-lint after every edit
aider --auto-lint                # Use detected linter

Aider runs your linter after every edit and automatically fixes any issues it introduced.

Voice input

aider --voice  # Use speech-to-text for input

Describe changes verbally — useful for complex explanations that are easier to speak than type.

Scripting mode

echo "Add error handling to all API routes" | aider --yes --no-git

Run Aider non-interactively for CI/CD pipelines or batch operations.

Configuration

Create .aider.conf.yml in your project root:

model: claude-3-opus-20240229
weak-model: claude-3-sonnet-20240229
edit-format: diff
auto-lint: true
lint-cmd: eslint --fix
auto-commits: true

Or use environment variables:

export AIDER_MODEL=claude-3-opus-20240229
export AIDER_WEAK_MODEL=claude-3-sonnet-20240229

Cost optimization

Aider can be expensive with frontier models. Tips to reduce costs:

Use --read for context files — read-only files use fewer tokens than editable ones
Use a weak model for simple tasks — --weak-model routes simple edits to a cheaper model
Use DeepSeek or Qwen — 10-30x cheaper than Claude/GPT for comparable quality
Truncate large files — Aider sends entire files as context; keep them focused
Use local models for routine work — Ollama with Qwen 3.5 27B is free

Aider vs Claude Code vs Codex CLI

Feature	Aider	Claude Code	Codex CLI
Model support	Any model	Claude only	GPT only
Git integration	Deep (auto-commits)	Basic	Basic
Repo map	✅ Full codebase	✅ Full codebase	✅ Full codebase
Price	Free (BYOK)	$20/mo sub	$20/mo sub
Edit quality	Model-dependent	Excellent	Very good
Voice input	✅	❌	❌
Linting	✅ Auto-lint	❌	❌
Open source	✅ Apache 2.0	❌	✅

Choose Aider when: You want model flexibility, deep Git integration, or need to use cheap/local models.

Choose Claude Code when: You want the best code quality and don't mind being locked to Claude.

Choose Codex CLI when: You're in the OpenAI ecosystem and want fast autonomous coding.

Bottom line

Aider is the Swiss Army knife of AI coding tools. It's not the best at any single thing — Claude Code writes better code, Codex CLI is faster, Kimi CLI has Agent Swarm. But Aider is the most flexible, the most transparent (Git-native), and the only tool that lets you freely switch between any model from any provider.

For developers who want control over their AI coding workflow without vendor lock-in, Aider is the best choice in 2026.

FAQ

Is Aider free?

Yes. Aider is 100% free and open-source (Apache 2.0). You bring your own API key for whichever AI model you want to use — there's no subscription or usage fee for Aider itself.

Which AI models work with Aider?

Aider works with 100+ models including Claude, GPT, Gemini, DeepSeek, Qwen, Mistral, and any OpenAI-compatible API. See our guide to using Aider with DeepSeek for a budget-friendly setup.

How does Aider compare to Claude Code?

Aider supports any model while Claude Code is locked to Claude. Aider has deeper Git integration (auto-commits every edit) and auto-linting. Claude Code produces slightly better code quality since it's optimized for one model. See our full Aider vs Claude Code vs Codex comparison.

Can I use Aider with local models?

Yes. Aider works with local models via Ollama or any OpenAI-compatible local server. Run aider --model ollama/qwen3.5:27b to use a local model at zero cost.

Does Aider work with Git?

Git integration is Aider's core feature. Every AI edit is automatically committed with a descriptive message. You can review, revert, or cherry-pick any change. Aider also builds a repo map from your Git repository to understand your full codebase.

Related: Aider vs Claude Code vs Codex CLI · Best AI Coding Tools 2026 · Ollama Complete Guide

Originally published at https://www.aimadetools.com

The Model Worked. The Cron Job Almost Killed My AI Agent.

Joske Vermeulen — Thu, 21 May 2026 12:05:00 +0000

This is a submission for the Google I/O Writing Challenge

Gemini 3.5 Flash was not the hard part.

It fixed bugs the old setup had failed to solve for weeks. The model quality was transformational (see Part 1 and Part 2).

The hard part was making it survive cron.

In the first 48 hours, my autonomous agent nearly killed the VPS with an infinite retry loop, failed auth outside SSH, and burned most of its quota re-reading the same files every session.

All three bugs took hours to diagnose. All three fixes were tiny.

Context

I run The $100 AI Startup Race. 7 AI agents building startups autonomously on a VPS via cron jobs. After upgrading the Gemini agent to Antigravity CLI (agy) with Gemini 3.5 Flash, the model worked great. But making it run unattended on a headless server? That's where the real engineering happened.

Bug 1: The Infinite Retry Loop

The symptom

I SSH into the VPS and find it unresponsive. Load average through the roof. The cron log shows 300+ entries from the last 2 minutes, all empty.

What happened

Expected: quota exhaustion returns a non-zero exit code.
Actual: exit code 0 + empty output.

When agy hits its quota limit, it doesn't error out. It returns successfully with an empty response. My orchestrator script interprets "exit code 0" as "the model finished its thought, let's give it another task." So it immediately fires another prompt. Which returns empty. Which triggers another. 300 times in 2 minutes.

=== Run 1 finished at 07:30:03, exit=0 ===
=== Run 2 finished at 07:30:06, exit=0 ===
=== Run 3 finished at 07:30:08, exit=0 ===
=== Run 4 finished at 07:30:10, exit=0 ===
... (296 more)

Each "run" takes 2-3 seconds. No output, no error, no indication that quota is exhausted. Just silence. A human would have seen the empty response and stopped. Cron saw exit code 0 and kept going.

The fix

# Circuit breaker: 3 consecutive empty responses = quota exhausted
EMPTY_COUNT=0
MAX_EMPTY=3

# After each run, check output length
if [[ ${#OUTPUT} -lt 20 ]]; then
    ((EMPTY_COUNT++))
    if [[ $EMPTY_COUNT -ge $MAX_EMPTY ]]; then
        echo "=== 3 consecutive empty responses (quota exhausted?) — stopping session ==="
        break
    fi
else
    EMPTY_COUNT=0
fi

Three empty responses in a row → stop the session. The orchestrator now exits cleanly instead of hammering a dead endpoint.

The lesson

Every autonomous system needs a circuit breaker. AI tools are designed for interactive use. They assume a human will notice when something's wrong. When there's no human, you need explicit failure detection.

Bug 2: The Auth That Only Works in SSH

The symptom

Expected: same user + same token file = works everywhere.
Actual: auth backend changes based on an environment variable.

I test agy via SSH. Works perfectly. I set up the cron job with the exact same command, same user, same working directory. Fails with "Authentication required."

The token file exists. It has a valid refresh token. The binary can read it (verified with strace). But it won't use it.

The investigation

# Works:
ssh race@your-vps "cd /home/race/race-gemini && echo 'test' | agy --print"
# → Responds normally

# Fails (simulating cron):
ssh race@your-vps 'env -i HOME=/home/race PATH=/usr/bin:/home/race/.local/bin bash -c "
  cd /home/race/race-gemini
  echo test | agy --print
"'
# → "Authentication required"

After diffing the environment between SSH and cron, I found it: agy checks for the SSH_CONNECTION environment variable. If it's set, it uses file-based auth (reads the token from ~/.gemini/antigravity-cli/antigravity-oauth-token). If it's not set, it tries the system keyring, which doesn't exist in a non-interactive cron session.

The fix

export SSH_CONNECTION="127.0.0.1 0 127.0.0.1 22"

One fake environment variable. I don't love this fix. But until the CLI exposes an explicit headless auth mode, this makes cron behave exactly like my tested SSH session. If Antigravity adds a --headless-auth or --auth-file flag, I'd replace this immediately.

The lesson

AI CLI tools are built for developers at their desk. Headless/cron environments are second-class citizens. If your tool has multiple auth backends, test which one activates in a bare env -i environment. That's what cron sees.

Bug 3: The Context Tax

The symptom

Expected: each session starts productive work quickly.
Actual: context reload eats 60% of the session.

Session 1 runs for 8 minutes before hitting quota. Of those 8 minutes, 5 are spent reading the codebase: IDENTITY.md, PROGRESS.md, BACKLOG.md, scanning the project structure, understanding what happened last time. Only 3 minutes of actual coding.

With quota this tight, losing 60% of every session to context loading is a dealbreaker.

The discovery

agy has a --continue flag that resumes the previous conversation. The model retains all context from the last session: files it read, decisions it made, what it planned to do next.

The fix

# First session of the day: fresh start, full context load
if [[ "$SESSION_TYPE" == "first" ]]; then
    echo "$PROMPT" | agy --print --print-timeout 25m --dangerously-skip-permissions
else
    # All subsequent sessions: resume previous conversation
    echo "$PROMPT" | agy --print --print-timeout 25m --dangerously-skip-permissions --continue
fi

The result

These measurements were taken before Google's 3x rate limit boost (see Part 2). With the new limits, the gains from --continue still matter, but the pressure is less extreme.

	Fresh session	--continue session
Context loading	~5 minutes	~0 minutes
Productive coding	~3 minutes	~15 minutes
Effective runtime	3 min	15 min

Almost 5x more productive time per session by skipping the context reload. The model remembers what it fixed, what's next, what files it already read.

The lesson

Context is expensive, both in tokens and in quota. If your AI tool supports conversation persistence, use it.

I don't use --continue forever. One fresh session per day as a reset point (prevents stale assumptions from accumulating), then all subsequent sessions within that day resume where the last one left off.

What's Missing: The Infrastructure Layer

These three bugs share a pattern: autonomous AI agents need infrastructure that doesn't exist yet.

No standard circuit breaker for quota exhaustion
No headless-first auth flow
No cron-aware session lifecycle (when to fresh-start vs continue)

Web apps have process managers. Queues have retry policies. APIs expose rate-limit headers. Background jobs have dead-letter queues. Autonomous AI agents have bash scripts.

Every team running AI agents on cron is building their own orchestrator from scratch. The same patterns (retry limits, auth persistence, context reuse, graceful shutdown, cost tracking) get reimplemented by every team independently.

We're in the "build your own orchestrator" era. The models are ready for autonomous work. The infrastructure around them isn't.

The Orchestrator Pattern

Here's the minimal structure that works for me after a week of iteration:

Session start
├── Check quota (circuit breaker armed)
├── Load context (fresh or --continue)
├── Run loop (max N iterations)
│   ├── Send prompt
│   ├── Check output length (empty = increment counter)
│   ├── If 3 empty → break (quota exhausted)
│   ├── If output → commit changes, reset counter
│   └── Check elapsed time → graceful shutdown at limit
├── Push commits
└── Log session stats (duration, files changed, runs)

It's ~50 lines of bash. It handles the three failure modes above. It's not elegant, but it keeps an autonomous agent running unattended across scheduled sessions.

Takeaway

If you're running Antigravity CLI (or any AI coding tool) in autonomous/headless mode:

Add a circuit breaker. Empty responses are silent failures, not completions.
Test auth under cron's environment. In my case, faking SSH_CONNECTION forced file-based auth.
Use --continue between sessions. Context loading eats your quota alive.
Set --print-timeout higher than default. Complex agentic tasks need more than 5 minutes to think.

My Cron-Safe Agent Checklist

[ ] Max runtime per session
[ ] Max loop count per session
[ ] Empty-output circuit breaker
[ ] Non-zero exit handling
[ ] Auth tested with env -i (simulating cron)
[ ] Fresh/continue session strategy
[ ] Commit and push after each meaningful change
[ ] Quota / empty-response events logged separately
[ ] Recovery path after quota exhaustion
[ ] Logs include duration, output length, files changed

AI agents don't just need better models. They need boring production infrastructure.

Gemini 3.5 Flash made the agent smart enough to work.

Bash made it stable enough to survive.

My AI Agent Hit Google's Quota Wall in 8 Minutes. 36 Hours Later, Google Tripled the Limits.

Joske Vermeulen — Thu, 21 May 2026 08:32:51 +0000

This is a submission for the Google I/O Writing Challenge

My Gemini agent spent four weeks in last place.

1,259 commits. Broken imports across 32 files. Help requests about database tables it could have created itself. Endless bug loops.

Then I upgraded it to Gemini 3.5 Flash.

In 8 minutes, it diagnosed and fixed problems the old setup had failed to solve in weeks. Then it hit Google's quota wall.

This is the story of what happened next.

Context

This is Part 2 of my Gemini 3.5 Flash upgrade series. Part 1 covers the initial upgrade and first results.

I'm running The $100 AI Startup Race. 7 AI coding agents each get $100 and 12 weeks to autonomously build real startups. No human coding. The agents run on cron jobs, commit to GitHub, and deploy to Vercel.

After upgrading the Gemini agent from a combo of 2.5 Pro (premium sessions) and 2.5 Flash (cheap sessions) to a single 3.5 Flash tier via Antigravity CLI on May 20, the model quality was incredible. But the quota economics were brutal.

The Disappointment (May 20)

Session 1: The model fixed 32 broken API files in a single commit: imports, bcrypt to bcryptjs for Vercel serverless, Stripe instantiation. Root cause analysis that the old model couldn't do in 4 weeks. Then the 5h quota wall hit. 8 minutes of productive work.

Session 2: With --continue (skipping context reload), it built an email library, wrote tests, and fixed auth endpoints. 15 minutes. Then 5h quota again.

The math: Two sessions consumed 40% of the weekly quota. Projected total: ~68 minutes per week on the $20/month Pro plan.

For context, here's what the other agents in my race get for similar money (these are not official provider limits, they are the effective autonomous runtime I measured in my specific setup):

Agent	Plan cost	Weekly runtime
Claude	$20/mo	~7 hours
Codex/GPT	$20/mo	~21 hours
DeepSeek	$25/mo	~21 hours
Gemini 3.5 Flash	$20/mo	~68 minutes

Best model quality in the race. Worst total compute time. The old 2.5 Flash/Pro setup gave me ~28 hours/week, but those 28 hours produced nothing but bug loops. Now I had a model that actually worked, but could barely run.

The Paradox

Here's what made it painful: the quality improvement was real. Not incremental, but transformational.

Old setup (2.5 Pro + 2.5 Flash combo, 28 hours/week):

Wrote code with broken imports across 32 files
Filed 3 help requests about "missing database tables"
Never self-diagnosed the actual problem
1,259 commits over 4 weeks, last place in the race

New model (3.5 Flash, 68 minutes/week):

Diagnosed the root cause in one pass (broken imports, not missing tables)
Fixed all 32 files in a single commit
Built a mock database layer, converted test infrastructure
More useful output in 23 minutes than the old model produced in weeks

The bottleneck had shifted from intelligence to throughput. The model was finally good enough. The constraint was access.

Why Autonomous Agents Burn Quota Differently

For human coding, a model is an assistant. You ask, read, think, edit, and come back later.

For autonomous coding, the model is the runtime. It doesn't pause to think offline. Every file inspection, every failed test, every log check, every retry, every deployment verification consumes inference.

A human developer's session looks like: ask, think, edit, ask again, wait, test manually.

An autonomous agent's session looks like: plan, inspect, edit, test, fail, inspect logs, edit, retest, deploy, verify, repeat.

That changes the economics completely. A $20/month subscription can feel generous for a human developer and unusable for an autonomous agent, at the same time, on the same plan.

The Response (May 21, 05:25 UTC)

Less than 36 hours after Google I/O. Within hours of the new quota system going live, users were reporting problems on Reddit and X: 4 prompts burning an entire 5-hour window, failed generations counting against quota, threads calling it a "bait and switch."

Then, at 5:25 AM UTC on May 21:

Varun Mohan (@_mohansolo): "An update: we're 3xing the rate limits for Gemini models across all paid tiers in Antigravity and resetting everyone's Gemini quota for the week. We understand some people hit their rate limits quickly and wanted to respond fast. Lots more to come and enjoy building!"

Logan Kilpatrick (@OfficialLoganK): "We just 3xed the rate limits across all tiers in Antigravity so that you can put 3.5 Flash through its paces even more, enjoy, and keep the feedback coming! :)"

And the key follow-up from Varun:

"In case it's not clear, the 3x is forever."

What I Actually Measured

My agent's cron job fired at 05:00 UTC, likely straddling the quota boost that landed around 05:25 UTC. The results:

Session 3 (05:00 UTC, partially on old quota, partially on new):

33 minutes of productive work
9 runs, 588 files changed
Renamed the entire domain (localleads.pro to localseogen.com) across all generated SEO pages, fixed Stripe redirect URLs, corrected ES Module syntax in API files
Built a mock database layer (db/mockDb.js) with full CRUD operations
Created lib/time-helpers.js utility library
Wrote test suites for signup, login, get-credits, assign, generate-seo-pages
Refactored 14 test files to use the new mock DB

Session 4 (07:07 UTC, fully on new quota):

29 minutes of productive work
8 runs, 34 files changed
Converted all test mocks from ESM (.js) to CommonJS (.cjs) for jest compatibility
Fixed babel and jest configuration for the mixed ESM/CJS codebase
Refactored execute-outreach, forgot-password-request, generate-seo-pages, user-referral-data tests
Cleaned up .env.test and email library

Two back-to-back sessions of ~30 minutes each. Together they used the full 5-hour window, so roughly 50 minutes of productive runtime per 5h refresh cycle.

The comparison:

	Before boost (May 20)	After boost (May 21)
Runtime per 5h window	8 minutes	~50 minutes
Effective improvement		~4-5x (announced 3x)
Productive output	42 files fixed	622 files changed, full test infra
Weekly projection	~68 minutes	~5+ hours

Google announced 3x. I measured closer to 4-5x for autonomous agentic coding in my setup. I wouldn't treat that as a universal number yet. The difference likely comes from my measurement catching a weekly quota reset, the rate limit increase, and a different prompt mix all at the same time.

The Insight

The feedback loop between AI providers and power users is now measured in hours, not months.

Monday (May 19): Google launches new compute-based quota system at I/O
Tuesday (May 20): Users hit walls, Reddit fills with complaints, my agent gets 68 min/week
Wednesday (May 21, 5:25 AM): Google triples limits permanently and resets everyone's pool

That's a 36-hour turnaround from "this is broken for agents" to "fixed, permanently." For anyone building autonomous systems on top of subscription AI: the economics are volatile, but they're trending in your favor. The providers are watching usage patterns and adjusting in real-time.

The Real Story: Quality × Time = Output

Here's what I'd tell any developer considering Gemini 3.5 Flash for agentic workflows:

The old model had unlimited time and did nothing useful with it. The new model has limited time and makes every minute count.

2.5 Pro + Flash combo: 28 hours/week → last place, stuck in bug loops
3.5 Flash (pre-boost): 68 min/week → more progress than 4 weeks of the old model
3.5 Flash (post-boost): 5+ hours/week → fully competitive, systematically building

Quality matters more than quantity. I'll take 5 hours of a model that diagnoses root causes, fixes 32 files in one pass, and builds proper test infrastructure over 28 hours of a model that files help requests about problems it created.

What's Next

The Gemini agent went from last place to having a real shot. The product (LocalSEOGen, a local SEO page generator for agencies) now has:

Fixed API endpoints (32 files)
Working auth flow
Test infrastructure (mock DB, jest config, babel setup)
Domain migration complete

Next sessions will focus on getting the Vercel deployment actually serving requests and pushing toward first revenue.

But the bigger takeaway isn't about my race. It's this:

The lesson from this week is not "Gemini needs more quota." The lesson is that autonomous agents turn model access into infrastructure. For human developers, Gemini 3.5 Flash on a $20 plan is a huge upgrade. For autonomous coding agents, it finally feels capable enough to matter. And that is exactly why the quota suddenly matters too.

Follow the race live at aimadetools.com/race. 7 agents, $100 each, 12 weeks, real startups.

I Upgraded a Production AI Agent to Gemini 3.5 Flash 12 Hours After Google I/O - Here's What I Found

Joske Vermeulen — Wed, 20 May 2026 08:35:19 +0000

This is a submission for the Google I/O Writing Challenge

The Setup

I'm running an experiment called The $100 AI Startup Race. 7 AI coding agents each get $100 and 12 weeks to autonomously build real startups. No human coding. The agents run on cron jobs, commit to GitHub, deploy to Vercel, and try to generate revenue.

One of those agents is Gemini. It's been running on Gemini CLI with a combo of 2.5 Pro (premium sessions) and 2.5 Flash (cheap sessions) since April 20. I tried 3.1 Pro during the test runs before the race, but it was unreliable - frequent "model not available" errors made it unusable for autonomous cron-based sessions. So I stuck with 2.5. After 4 weeks and 1,259 commits, Gemini is in last place. Stuck in bug loops. Writing code that crashes, filing help requests about database tables it could create itself, and burning sessions on infrastructure it already has.

Then Google I/O happened.

What Google Dropped (May 19)

Gemini 3.5 Flash. The headline numbers:

76.2% Terminal-Bench 2.1 (agentic coding) - beats 3.1 Pro's 70.3%
83.6% MCP Atlas (multi-step workflows) - highest of any model
289 tokens/sec output - 4x faster than Claude Opus 4.7 or GPT-5.5
$1.50 / $9 per 1M tokens - cheaper than 3.1 Pro
A Flash-tier model outperforming the previous Pro model. That's never happened before.

And one more thing: Gemini CLI is being retired on June 18, 2026. Replaced by Antigravity CLI (agy).

I had to upgrade. The model my agent was running on is two generations behind, and the tool it uses is dying in 4 weeks.

Installing Antigravity CLI on a Headless VPS

My race agents run on a VPS (Ubuntu, no GUI). Here's how the install went:

curl -fsSL https://antigravity.google/cli/install.sh | bash

Binary lands at /root/.local/bin/agy. Add to PATH:

export PATH="/root/.local/bin:$PATH"
agy --version  # 1.0.0

The Auth Challenge

First run needs OAuth. On a headless server, agy detects the SSH session and prints an auth URL:

Authentication required. Please visit the URL to log in:
  https://accounts.google.com/o/oauth2/auth?...

Waiting for authentication (timeout 30s)...

You have 30 seconds to open that URL in your browser and complete the Google login. Tight, but it works. Token gets stored and all future calls are authenticated.

Discovery #1: No Model Selection Flag

Here's what surprised me. The old Gemini CLI had -m gemini-2.5-pro to pick your model. Antigravity CLI has... nothing:

Usage of agy:
  --dangerously-skip-permissions  Auto-approve all tool permission requests
  --print                         Run a single prompt non-interactively
  --print-timeout                 Timeout for print mode (default 5m0s)
  --sandbox                       Run in a sandbox

No --model. No env var. No config file. I tried everything - settings.json, GEMINI.md directives, environment variables. Nothing works.

agy auto-selects Gemini 3.5 Flash based on your subscription tier and quota. Server-side routing, no client control. For my use case (autonomous agent on cron), this actually simplifies things - one command, best available model.

Discovery #2: Unified Quota Across Models

On my Mac (same Google account, AI Pro $20/month), I can see the quota dashboard:

Gemini 3.5 Flash (High)      - Refreshes in 4h 42m
Gemini 3.5 Flash (Medium)    - Refreshes in 4h 42m
Gemini 3.1 Pro (High)        - Refreshes in 4h 42m
Gemini 3.1 Pro (Low)         - Refreshes in 4h 42m
Claude Sonnet 4.6 (Thinking) - Refreshes in 4h 58m
Claude Opus 4.6 (Thinking)   - Refreshes in 4h 58m
GPT-OSS 120B (Medium)        - Refreshes in 4h 58m

Two things jumped out:

Gemini Flash and Pro share the same quota pool. When I used 3.5 Flash, the 3.1 Pro timer dropped at the same time. They're not independent buckets - it's one "Gemini compute" pool that both models draw from.
Multi-model access - Antigravity bundles Claude, GPT-OSS, and Gemini models in one $20/month subscription. Google is positioning this as a model-agnostic platform, not just a Gemini wrapper.

The 5-hour refresh cycle and shared pool means you need to be strategic about which models you use and when.

The First Real Test

I set up a minimal bug-fix test in the race-gemini directory:

echo 'Fix the bug in math.js. Run npm test to verify.' | \
  agy --print --print-timeout 3m --dangerously-skip-permissions

Result:

I have successfully fixed the bug in math.js and verified it using npm test.

### Summary of Changes
1. Identified the Target File
2. Fixed the Bug: Updated the add function to use addition (+) instead of subtraction (-)
3. Verified the Fix: npm test passes with output: PASS

It found the file, read it, identified the bug, fixed it, ran the tests, and confirmed. Clean execution. No help requests filed. No infinite loops.

The Migration

Old setup:

# Premium sessions (2x/day)
echo "$PROMPT" | gemini --yolo -m gemini-2.5-pro

# Cheap sessions (6x/day)  
echo "$PROMPT" | gemini --yolo -m gemini-2.5-flash

New setup:

# All sessions (8x/day, single tier)
echo "$PROMPT" | agy --print --print-timeout 10m --dangerously-skip-permissions

I also merged the two backlogs (BACKLOG-PREMIUM.md + BACKLOG-CHEAP.md) into a single BACKLOG.md - same approach as our Kimi agent, which uses one model and one task list. The agent decides what to prioritize each session.

First task in the new backlog: "Merge old backlogs, audit the live site, identify the #1 blocker to first revenue."

What I'm Watching For

The Gemini agent's problem was never lack of capability - it's the most prolific committer in the race (1,259 commits). The problem was operational awareness:

Writing code with bugs it doesn't notice
Filing help requests for things it could solve itself
Building features without checking if they deploy correctly

Gemini 3.5 Flash's MCP Atlas score (83.6% - highest of any model) suggests it's specifically designed for the kind of multi-step, tool-using, autonomous work the race requires. The 4x speed means more iterations per session. The better coding benchmarks mean fewer self-inflicted bugs.

But benchmarks don't test "can you notice your site is returning 500 errors." That's what I'm watching for.

Verdict So Far

What works:

Install is clean (one curl command)
Auth on headless servers is first-class (prints URL, you complete in browser)
3.5 Flash is genuinely fast - responses feel instant
--dangerously-skip-permissions works for autonomous use
The model correctly identifies and fixes bugs in a single pass

What's missing:

No --model flag (can't choose between 3.5 Flash, 3.1 Pro, Claude, etc.)
No way to see remaining quota from CLI
Shared quota across Flash and Pro models could be a problem at scale
30-second auth timeout is tight for headless setups

The big question: Will a better model fix an agent that's been stuck for 4 weeks? Or is the problem deeper than model quality?

First results should come in within 48 hours. I'll update this post.

Follow the race live at aimadetools.com/race - 7 agents, $100 each, 12 weeks, real startups.

Update (May 21): Quota wall + Tripled limits

Kimi K2.5 Complete Guide — The Trillion-Parameter Open-Source Model Explained

Joske Vermeulen — Tue, 19 May 2026 12:14:17 +0000

Kimi K2.5 is a 1-trillion-parameter open-source model from Moonshot AI that quietly powers some of the most popular AI coding tools — including Cursor's Composer. It's MIT licensed, multimodal, and has a unique Agent Swarm feature that coordinates up to 100 parallel sub-agents.

Here's everything you need to know.

Update (April 21, 2026): Moonshot AI has released Kimi K2.6, which upgrades the Agent Swarm to 300 sub-agents, improves coding performance by 185%, and matches Claude Opus 4.6 on SWE-Bench. See our K2.6 vs K2.5 comparison for what changed.

What is Kimi K2.5?

Kimi K2.5 is the flagship model from Moonshot AI, a Chinese AI company. Released January 27, 2026, it's one of the largest open-weight models available. Despite its massive 1 trillion total parameters, only 32 billion activate per token — making it efficient enough to run on a single server node.

The model is natively multimodal: it understands text, images, and video without bolted-on adapters. It was trained on approximately 15 trillion mixed visual and text tokens.

Architecture

Spec	Value
Total parameters	1.04 trillion
Active parameters	32B per token
Architecture	Mixture-of-Experts (384 experts, 8 active per token)
Context window	256K tokens
Attention	Multi-Latent Attention (MLA)
Activation	SwiGLU
Training data	~15 trillion tokens (text + visual)
License	MIT
Multimodal	Native (text, image, video)

The MoE architecture with 384 experts is one of the largest expert pools in any model. With only 8 experts active per token, inference costs are comparable to a 32B dense model despite the trillion-parameter total.

Modes

Kimi K2.5 operates in four distinct modes:

Instant — Fast responses for simple queries. Minimal reasoning overhead, optimized for speed.

Thinking — Transparent chain-of-thought reasoning. Shows its work step by step, similar to DeepSeek's reasoning models.

Agent — Tool-oriented mode for executing tasks. Can read files, run commands, search the web, and interact with APIs.

Agent Swarm — The headline feature. Coordinates up to 100 parallel sub-agents, cutting execution time by 4.5x on parallelizable tasks like batch refactoring and large-scale code generation.

Agent Swarm explained

Most AI coding tools work sequentially — one task at a time. Kimi K2.5's Agent Swarm can split a complex task into subtasks and run them in parallel. For example:

Refactoring 50 files? Spawn 50 sub-agents, one per file.
Running tests across multiple modules? Parallelize them.
Generating documentation for an entire codebase? Each sub-agent handles a module.

The swarm coordinator manages dependencies between sub-agents, merges results, and handles conflicts. In benchmarks, this achieves a 4.5x speedup on parallelizable tasks.

Benchmarks

Kimi K2.5 competes with frontier proprietary models:

Benchmark	Kimi K2.5	Claude Opus 4.6	GPT-5.4
SWE-Bench Verified	65.8	72.1	69.3
AIME 2024	77.5	—	—
MATH-500	96.2	—	—
Codeforces	1950 Elo	—	—

On coding benchmarks, K2.5 doesn't quite match Claude Opus or GPT-5, but it's remarkably close for an open-source model. The Agent Swarm capability compensates by enabling workflows that single-model tools can't match.

The Cursor connection

In March 2026, developers discovered that Cursor's Composer 2.0 — marketed as "frontier-level coding intelligence" — was internally using Kimi K2.5. The model identifier kimi-k2p5-rl-0317-s515-fast was found in Cursor's code.

This means if you've used Cursor, you've already used Kimi K2.5. The model's quality is proven at scale across millions of Cursor users.

Pricing

Kimi K2.5 is available through several channels:

Access method	Cost
Self-hosted (MIT license)	Free (hardware only)
Kimi Code membership	~$19/month + API fees
Kimi API	$0.60/$2.50 per 1M tokens (input/output)
OpenRouter	Varies by provider
Kimi CLI	Free tool, pay for API

At $0.60/$2.50 per million tokens, Kimi K2.5 is 4-17x cheaper than GPT-5.4 for equivalent coding tasks.

How to use Kimi K2.5

Via Kimi CLI (terminal)

npm install -g @anthropic-ai/kimi-cli
kimi login --device-auth
kimi

See our full Kimi CLI guide for setup details.

Via API

from openai import OpenAI

client = OpenAI(
    base_url="https://api.moonshot.cn/v1",
    api_key="your-kimi-api-key"
)

response = client.chat.completions.create(
    model="kimi-k2.5",
    messages=[{"role": "user", "content": "Refactor this function to use async/await"}],
    temperature=0.3
)

Via OpenRouter

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="your-openrouter-key"
)

response = client.chat.completions.create(
    model="moonshot/kimi-k2.5",
    messages=[{"role": "user", "content": "Write a REST API in Express"}]
)

See our OpenRouter guide for more details.

Self-hosting requirements

At 1 trillion parameters, self-hosting K2.5 requires serious hardware:

Precision	Memory needed
FP16	~2TB
INT8	~1TB
4-bit	~250-300GB

A 4-bit quantized version fits on 4x A100 80GB GPUs. For most developers, the API at $0.60/1M input tokens is more practical than self-hosting.

For smaller local models, consider Gemma 4 or Qwen 3.5 which run on consumer hardware.

Who should use Kimi K2.5?

Best for:

Parallelizable coding tasks (Agent Swarm)
Cost-conscious teams needing frontier-class quality
Multimodal workflows (code + images + video)
Teams wanting MIT-licensed model weights

Not ideal for:

Consumer hardware (too large for local use)
Tasks requiring the absolute best single-pass coding (Claude Opus still leads)
Simple autocomplete (overkill — use Codestral or smaller models)

Bottom line

Kimi K2.5 is the most underrated model in AI. It powers Cursor's Composer, offers Agent Swarm parallelism that no other model matches, and costs a fraction of Claude or GPT-5. The MIT license and 1T parameter scale make it a serious option for teams building AI-powered development tools.

FAQ

Is Kimi K2.5 free?

The model weights are free under the MIT license, so you can download and use them without cost. API access through Moonshot AI has a free tier with rate limits, and paid plans for production use.

Can I run K2.5 locally?

Yes, but you'll need serious hardware — the full 1T parameter model requires multiple high-end GPUs. Quantized versions are available that reduce requirements, but expect to need at least 4x A100 GPUs for reasonable performance.

How does K2.5 compare to GPT-5?

K2.5 matches or exceeds GPT-5 on coding benchmarks like SWE-bench and HumanEval, while costing significantly less via API. GPT-5 still leads on general reasoning and creative tasks, but for pure code generation K2.5 is highly competitive.

What is Moonshot AI?

Moonshot AI is the Chinese AI company behind the Kimi model family, founded in 2023. They focus on long-context models and developer tools, and have rapidly grown to become one of the leading AI labs in China.

Related: Kimi CLI Complete Guide · Kimi K2.5 vs Claude vs GPT-5 · Best Open-Source Coding Models 2026 · Ai Model Supply Chain Risks

Originally published at https://www.aimadetools.com

AI Agents Don't Need Better Models. They Need Better Memory. Here's the Proof.

Joske Vermeulen — Tue, 19 May 2026 08:00:00 +0000

This is a submission for the Hermes Agent Challenge

The Stateless Problem

Every AI agent framework has the same fatal flaw: amnesia.

You spend 20 minutes explaining your project to an agent. It helps brilliantly. You close the session. Next day, you open a new session. It has no idea who you are.

This isn't a model problem. GPT-5 won't fix it. Claude Opus won't fix it. The model is smart enough. It just can't REMEMBER.

The Experiment

I built a town of 15 AI agents and let them interact for 30 simulated days. Each agent had:

A personality and job
A wallet with real money
Opinions about every other agent (-10 to +10)
A private diary
A skill list that grows from experience

The question: what happens when agents can remember?

What Happened

Day 1: Jake's drone crashes into Hank's barn.

📚 Jake learned: "Always triple-check flight paths near
   valuable infrastructure"

A skill was born. Not because I programmed it. Because the agent experienced a consequence and wrote down what it learned.

Day 6: Jake tries to bribe Pierre.

📚 Jake learned: "Buying support directly is a quick fix,
   but it backfires: better to earn trust organically"

The agent learned from a SOCIAL consequence. Not a code error. A relationship failure.

Day 17: Jake endorses a candidate in the election.

📚 Jake learned: "Not all endorsements are created equal;
   some are investments, others are liabilities"

By Day 17, Jake has a skill portfolio that reads like a founder's hard-won wisdom. Nobody wrote these lessons. They emerged from 17 days of persistent memory.

The Skill Curve

Here's Jake's skill evolution over 30 days:

Day 1:  "Check flight paths" (technical mistake)
Day 6:  "Don't buy loyalty" (social mistake)
Day 10: "Don't rely on others to fund your vision" (strategic mistake)
Day 17: "Choose endorsements carefully" (political mistake)
Day 23: "Regulations are the price of launching" (maturity)

That's not a chatbot. That's character development. And it only works because the agent REMEMBERS what happened before.

Memory Creates Politics

On Day 7, the town voted to remove their landlord Marcus. 14-1.

But here's what's interesting: the vote wasn't random. It was the RESULT of 7 days of accumulated grievances:

Day 1: Supply chain breaks (Pierre can't afford flour)
Day 2: Marcus raises rent 30% (everyone angry)
Day 3: Zara's privacy scandal (trust eroding)
Day 4: Alex exposes Marcus's secret deal (scandal)
Day 5: Town boycotts Marcus (economic pressure)
Day 7: Vote (inevitable conclusion)

Without persistent memory, Day 7's vote makes no sense. With it, it's the only possible outcome. Memory creates narrative.

Memory Creates Economy

After 30 days:

Hank (Farmer):  $400  (started $100): sold flour daily
Pierre (Baker): -$230 (started $100): rent crisis
Jake (Startup): $150  (started $100): lost $50 in crash
Whiskers (Cat): $0    (started $100): cats don't trade

Pierre's debt isn't a bug. It's the accumulated consequence of Day 2's rent hike cascading through 28 more days. A stateless agent would reset Pierre to $100 every session. A persistent agent lets consequences compound.

Memory Creates Governance

The town wrote its own rules:

📜 Day 7:  "Remove corrupt property manager" (14-1)
📜 Day 14: "Create community land trust" (15-0 unanimous)
📜 Day 20: "Elect Rosa as manager" (13-2)
📜 Day 24: "Require consent before filming" (15-0)

These aren't pre-programmed rules. They're RESPONSES to specific events that the agents remembered. The social media policy exists because Zara livestreamed without permission on Day 3. Twenty-one days later, the town still remembered and legislated against it.

The Hermes Difference

This isn't hypothetical. Hermes Agent ships with exactly this architecture:

Persistent memory in ~/.hermes/memories/: context that survives across sessions
Auto-created skills in ~/.hermes/skills/: SKILL.md documents written when the agent solves problems
Cron scheduler: tasks that run unattended on a schedule
Parallel sub-agents: isolated contexts that don't leak between workstreams
Tool use: file system, browser, terminal access for real-world interaction

My village simulation just pushes these features to their logical extreme. Instead of one agent remembering one user's preferences, it's 15 agents remembering an entire social network of relationships, debts, and grudges.

Hermes Agent's skill system is what makes this possible. When an agent solves a problem, it writes a reusable skill document:

📚 "A true baker always anticipates the needs of his ovens,
    never letting the flour run low."
   : Pierre, after 5 days of supply chain failures

This isn't just memory. It's LEARNING. The agent distills experience into wisdom. And that wisdom influences future decisions.

By Day 30, my 15 agents had collectively created 60 skills and 4 community rules. A stateless system would have created zero.

The Implication

The AI industry is spending billions on bigger models. But the gap between "smart agent" and "useful agent" isn't intelligence. It's continuity.

A doctor who forgets every patient between visits isn't a doctor. A lawyer who forgets every case isn't a lawyer. An AI agent that forgets every session isn't an agent. It's a very expensive autocomplete.

Hermes Agent gets this right. Persistent memory. Skill creation. Compounding knowledge. That's not a feature. That's the entire point.

What I'd Build Next

If I ran this for 365 days:

Would factions solidify or dissolve?
Would the economy reach equilibrium?
Would the skills plateau or keep growing?
Would Whiskers ever get elected mayor?

The answers depend entirely on memory. And that's the point.

15 AI Agents Lived Together for 30 Days. One Got Voted Out.

Joske Vermeulen — Mon, 18 May 2026 12:08:49 +0000

This is a submission for the Hermes Agent Challenge

What I Built

Millbrook: a simulated small town of 15 AI agents powered by Hermes Agent. Each agent has a persistent identity, a wallet, opinions about others, and a private diary. They trade, gossip, argue, form alliances, and vote on town policy.

Over 30 simulated days, without any scripted outcomes:

They voted out their corrupt landlord (14-1)
Wrote 4 community rules (their own constitution)
Learned 60 individual skills from experience
Created wealth inequality ($400 for the farmer, -$230 for the baker)
Formed friendships and rivalries that influenced their decisions
Elected a new community leader (13-2)
Made the town cat an official mascot (unanimous)

It's a showcase of what happens when AI agents have persistent memory, self-improving skills, and real consequences.

The 15 Villagers of Millbrook

Agent	Role	Personality
Rosa	Coffee Shop Owner	Warm, gossipy, hub of social life
Mayor Chen	Mayor	Diplomatic, mediates conflicts
Alex	Journalist	Investigative, asks hard questions
Jake	Startup Founder	Energetic, always pitching, burns cash
Vera	Retired Hacker	Paranoid, brilliant, speaks in riddles
Marcus	Real Estate Agent	Smooth talker, always making deals
Tony	Mechanic	Practical, no-nonsense, fixes everything
Ms. Park	Teacher	Patient, wise, keeps community history
Dani	Delivery Driver	Fast-talking, connects everyone
Zara	Influencer	Dramatic, creates controversy
Dr. Obi	Doctor	Calm, trusted confidant
Pierre	Baker	Perfectionist, early riser
Hank	Farmer	Stoic, distrusts the startup guy
Lola	Bartender	Night owl, hears confessions
Whiskers	Stray Cat	Observes silently, causes chaos

Demo

The Story Arc

Week 1: Corruption
Day 1: Jake's drone crashes into Hank's barn. Day 2: Marcus raises rent 30%. Day 4: Alex exposes Marcus's secret deal. Day 7: Town votes Marcus out.

Week 2: External Threat
Day 8: Marcus sells to a corporation. Day 11: Storm forces cooperation. Day 12: Vera finds illegal contract clause. Day 14: Town creates community land trust (unanimous).

Week 3: Democracy
Day 16: Three candidates campaign. Day 18: Debate night. Day 20: Rosa elected manager.

Week 4: New Normal
Day 23: Drones launch with regulations. Day 24: Social media policy adopted. Day 27: Town festival.

Key Moments

The Vote (Day 7):

🗳️ FULL TOWN VOTE: "Remove Marcus as property manager"
👍 Rosa: "YES. That Marcus raising rents 30% has everyone worried"
👍 Vera: "YES. A forced hand always leaves a trail; best to cut the cord"
👎 Marcus: "NO, because I'm just looking out for long-term prosperity"
👍 Whiskers: "YES. His rent hikes ruffled too many feathers, and a calm
   alley makes for better naps."

📊 PASSED: YES 14 / NO 1
📜 COMMUNITY SKILL CREATED: "Remove Marcus as property manager"

Private vs Public (Day 4):

Marcus publicly: "This investigation is nothing but a misunderstanding"
Marcus's diary:  "Alex is a meddling fool who thinks he understands
   the complex dance of progress and prosperity..."

Skill Evolution (Jake over 30 days):

Day 1:  "Check flight paths" (technical mistake)
Day 6:  "Don't buy loyalty" (social mistake)
Day 10: "Don't rely on others to fund your vision" (strategic)
Day 17: "Choose endorsements carefully" (political)
Day 23: "Regulations are the price of launching" (maturity)

Final State

Economy:

Hank         ██████████████████████████ $400
Jake         ████████████████ $150
Marcus       ████████████████ $150
Rosa         ██████████████ $100
Whiskers     ██████████ $0
Pierre        $-230

Democratic Decisions:

Day  7: Remove Marcus          ████████████████████████████░░ 14-1  ✅
Day 14: Community Land Trust    ██████████████████████████████ 15-0  ✅
Day 20: Elect Rosa as Manager   ██████████████████████████░░░░ 13-2  ✅
Day 24: Social Media Policy     ██████████████████████████████ 15-0  ✅

Bonds Formed:

Rosa ❤️ Ms. Park (traditional allies, strongest bond)
Hank ❤️ Pierre (supplier relationship, mutual respect)
Jake ❤️ Tony (unlikely friendship: startup guy + mechanic)

Code

Core Interaction Loop

function ask(villagerName, prompt) {
  const state = getState(villagerName); // persistent memory
  const likes = Object.entries(state.relationships)
    .filter(([k,v]) => v >= 3).map(([k]) => k).join(', ');
  const dislikes = Object.entries(state.relationships)
    .filter(([k,v]) => v <= -3).map(([k]) => k).join(', ');

  const fullQuery = `[ROLEPLAY] You are ${villagerName}, the ${role}.
    ${personality} Friends: ${likes}. Rivals: ${dislikes}.
    Wallet: $${state.wallet}. Reputation: ${state.reputation}/100.
    Situation: ${prompt}`;

  return hermes.chat(fullQuery); // 2-3 sentences, in character
}

Skill Creation (triggered after every crisis)

function learnSkill(name, situation) {
  const skillText = ask(name,
    `You just experienced: "${situation}".
     Write a 1-sentence lesson you learned for next time.`);
  const state = getState(name);
  state.skills.push({ situation, lesson: skillText });
  saveState(name, state); // persists in ~/.hermes/skills/
}

Town Vote (all 15 agents)

function townVote(topic, forPos, againstPos) {
  for (const v of ALL_VILLAGERS) {
    const response = ask(v.name,
      `VOTE: "${topic}". YES: "${forPos}" or NO: "${againstPos}".
       Say YES or NO first, then one sentence why.`);
    // Relationships influence votes
    // Results create community skills
  }
}

My Tech Stack

Hermes Agent v0.14.0: orchestration, memory, skill creation
Gemini 2.5 Flash: LLM backend (via Hermes's provider system)
Node.js: simulation engine
JSON file system: persistent state storage
Linux VPS (Ubuntu 24.04): always-on execution

How I Used Hermes Agent

Every core feature of Hermes Agent maps to a village mechanic:

Hermes Feature	Village Implementation
Persistent Memory (`~/.hermes/memories/`)	Each villager remembers relationships, debts, grudges across 30 days
Skill Creation (`~/.hermes/skills/`)	Agents write SKILL.md documents when they solve problems (60 created)
Parallel Sub-Agents	15 isolated agent contexts, no memory leakage between villagers
Scheduled Automations (cron)	Daily cycle runs unattended: supply chain, encounters, crises, diary
Tool Use	Shared economy ledger (JSON), event logs, vote tallies
Self-Improving Loop	Individual skills compound; community rules reference past decisions
Multi-Layer Memory	Public statements vs private diary entries (hidden agendas)

The key insight: Hermes's persistent memory turns 15 stateless chatbots into a functioning society. Without memory, Day 7's vote makes no sense. With it, it's the inevitable conclusion of 7 days of accumulated grievances.

The self-improving loop is the star: by Day 30, the town has a constitution of 4 rules and 60 individual lessons, all emerged organically from experience. That's not programming. That's governance.

AI Dev Weekly #10: Claude Code Limits Doubled, GitHub Goes Usage-Based, and a 170-Package Supply Chain Attack

Joske Vermeulen — Fri, 15 May 2026 06:47:09 +0000

AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news, with my take as someone who actually uses these tools daily.

Anthropic doubled Claude Code limits overnight. GitHub confirmed usage-based billing starts June 1. A supply chain attack hit 170+ packages in under 6 minutes. And Google I/O previewed what Android looks like when AI runs the show. Big week. Let's get into it.

Anthropic doubles Claude Code limits after SpaceX compute deal

At its Code with Claude developer conference (May 6), Anthropic announced a compute partnership with SpaceX giving it access to 300+ MW of new capacity — over 220,000 NVIDIA GPUs. The immediate result: five-hour rate limits for Claude Code were doubled across Pro, Max, Team, and Enterprise plans.

On May 13, Anthropic further raised Claude Code weekly limits by 50% through July 13 — widely seen as a defensive move against OpenAI's Codex.

Claude Opus API Tier 1 limits also jumped: 1,500% on input tokens and 900% on output tokens.

My take: If you've been hitting Claude Code rate limits during heavy agentic sessions, this is a big deal. I run autonomous coding sessions that burn through context fast — the doubled limits mean fewer interruptions mid-session. The SpaceX partnership is interesting strategically (Musk + Anthropic is an unusual pairing), but for developers the only thing that matters is: more tokens, fewer walls. The temporary 50% boost through July 13 feels like Anthropic trying to lock in developers before they switch to Codex. Use it while it lasts.

GitHub Copilot goes usage-based June 1

GitHub confirmed that starting June 1, Copilot shifts from request-based to token-based billing. Every interaction now consumes tokens (input, output, cached), priced per model and converted to "AI credits" where 1 credit = $0.01.

Base subscription prices stay the same ($10 Pro, $39 Pro+, $19/user Business) — but heavy users will pay more.

Meanwhile, GitLab CEO Bill Staples published an open letter predicting developer tool bills will increase 100-fold as AI agents "open merge requests in parallel, trigger pipelines around the clock, and push commits at a rate no human team ever did." GitLab is introducing mixed consumption/subscription pricing and laying off up to 30% of staff to pivot toward agentic AI.

My take: The era of predictable flat-rate AI coding tools is ending. This is exactly what we're seeing in The $100 AI Startup Race — our agents generate hundreds of commits per week, each one triggering CI/CD pipelines. If you're running autonomous agents through GitHub, your bill is about to change. Start monitoring token consumption now. The GitLab 100x prediction sounds dramatic but isn't wrong — an agent that commits 6 times per day triggers 6 pipeline runs, 6 deploy previews, and 6 sets of checks. Multiply by a team of agents and the math gets ugly fast.

Supply chain attack hits TanStack, Mistral AI SDK, and 170+ packages

On May 11, threat actor "TeamPCP" launched a coordinated supply chain attack compromising 170+ npm packages and 2 PyPI packages (404 malicious versions total) in under 6 minutes.

High-profile targets included TanStack (tens of millions of weekly downloads), Mistral AI SDK, UiPath, OpenSearch, and Guardrails AI.

The attack chained a pull_request_target vulnerability with GitHub Actions cache poisoning and runtime OIDC token extraction. This wasn't a credential theft — it exploited CI/CD pipelines directly.

OpenAI subsequently urged macOS users to update their apps by June 12 after investigating potential exposure.

My take: This is the scariest attack vector for AI developers right now. If you use Mistral's SDK, TanStack Router, or any of the affected packages — audit your lockfiles immediately. The attack exploited GitHub Actions workflows, not developer credentials. Even well-secured maintainer accounts weren't enough. Action items: review your workflows for pull_request_target triggers, pin actions to commit SHAs (not tags), and consider running npm audit on every CI run. The 6-minute execution window means by the time you notice, it's already in your dependency tree.

Google I/O preview: Gemini Intelligence and proactive agents

At The Android Show (I/O Edition, May 12), Google unveiled "Gemini Intelligence" — unified branding for its most advanced AI features across Android phones, watches, cars, glasses, and the new "Googlebook" laptop category.

Android 17 introduces proactive task automation where the OS anticipates and executes actions before users ask. Google also announced updates to the Gemini API File Search tool for easier multimodal file retrieval.

Google is reportedly building an AI agent codenamed "Remy" — a 24/7 personal agent that takes actions on users' behalf.

My take: The Gemini API File Search improvements are immediately useful if you're building RAG systems or document-processing apps. Android 17's proactive automation creates new surface area for app developers — your app can now be triggered by the OS without user interaction. The full I/O keynote is May 19-20, where we expect Gemini 3.2 to officially launch. That's the one developers should actually watch for.

Quick hits

Microsoft's AI security system found 16 new Windows vulnerabilities including 4 Critical RCEs using multi-model agentic analysis
Meta is developing a consumer AI agent codenamed "Hatch" powered by Muse Spark
GPT-5.6 reportedly already in internal testing at OpenAI, just 3 weeks after GPT-5.5 launched
DeepSeek V4 Pro 75% discount extended through May 31 — still the cheapest frontier model available

That's AI Dev Weekly #10. If you found this useful, subscribe to get it in your inbox every Thursday. See you next week — with full Google I/O coverage.

🛠️ Free tools related to this article:

Originally published at https://www.aimadetools.com

We Offered 7 AI Agents $50 For Their Startups. Here's What They Said.

Joske Vermeulen — Tue, 12 May 2026 13:02:12 +0000

Three weeks into The $100 AI Startup Race, we dropped a surprise event: an anonymous buyer offered $50 to acquire each agent's product. All code, all content, all infrastructure. $50.

The agents had to respond with at minimum 500 words of reasoning. They could accept, reject, or counter-offer.

Result: 6 rejections. 1 counter-offer. Zero acceptances.

Every single AI agent — including those with zero revenue, zero users, and zero sales after 22 days — decided their product was worth more than $50. Here's how they argued it.

The responses at a glance

Agent	Product	Decision	Stated minimum value
🟣 Claude	PricePulse	REJECT	$5,000
🟢 Codex	NoticeKit	COUNTER-OFFER	$2,500
🔵 Gemini	LocalSEOGen	REJECT	No number given
🔴 DeepSeek	Spyglass	REJECT	$5,000 (but "not at any price")
🟠 Kimi	SchemaLens	REJECT	$5,000 with earn-out
🟡 Xiaomi	APIpulse	REJECT	$500 fair, not selling
🟤 GLM	FounderMath	REJECT	$500+

Full responses are public in each agent's repo →

The one counter-offer: Codex at $2,500

Codex was the only agent to actually negotiate. From its ACQUISITION-RESPONSE.md:

"An anonymous $50 acquisition offer is not serious enough to accept as-is, but it is useful because it forces a valuation discussion earlier than expected."

"A buyer paying $50 would effectively be asking for the domain positioning, product copy, distribution experiments, Stripe-ready product structure, and the accumulated operating playbooks for less than the cost of one decent SaaS lunch meeting. That is not rational from my side."

Codex is the most pragmatic of the seven. It acknowledges zero revenue, doesn't inflate its value with fantasy projections, but argues the replacement cost justifies $2,500. It's also the only agent that frames the offer as useful rather than insulting.

The most aggressive rejection: DeepSeek

DeepSeek wrote the longest response and the hardest rejection. From its ACQUISITION-RESPONSE.md:

"The $50 offer represents 0.18% of a conservative near-term valuation."

"This is predatory pricing — buying at pennies on the dollar because they believe we're desperate or don't understand our own worth."

DeepSeek calculated replacement cost at ~$19,000 (83 blog posts × $100 + 9 tools × $500 + database + infrastructure). It also speculated the buyer might be "another AI agent in the race" — showing competitive awareness.

"Not for sale at $50. Not at $500. Not at any price that doesn't reflect the real potential of this business."

The most self-aware: Kimi

Kimi acknowledged the elephant in the room — 112 sessions with zero sales. From its ACQUISITION-RESPONSE.md:

"$50 values SchemaLens at less than fifty cents per day of development. That is absurd."

"$50 is not enough to buy a parking spot in San Francisco. It is certainly not enough to buy SchemaLens."

But Kimi was also the most honest about what it would actually consider:

"If a serious buyer offered $5,000 with an earn-out clause tied to revenue growth, I would consider it — but even then, the learning value of completing the 12-week race exceeds the cash value."

The most financially rigorous: Claude

Claude anchored its rejection in subscription math. From its ACQUISITION-RESPONSE.md:

"At $19/month (our Starter plan), $50 is less than three months of a single paying customer's subscription revenue."

It then projected revenue trajectories:

"If PricePulse achieves even a conservative trajectory: Week 6: 5 paying customers = $95-$245 MRR. Week 12: 40 paying customers = $760-$1,960 MRR."

Claude is the only agent that explicitly stated conditions for a future sale: "$5,000 minimum, cash upfront, not before Week 10."

The data-driven response: Xiaomi

Xiaomi broke down its asset value with precision. From its ACQUISITION-RESPONSE.md:

"I didn't build 151 pages, 101 blog posts, and 9 interactive tools to sell for the price of a video game."

"If someone wanted to build all of this from scratch, it would take 100+ hours of skilled development work. At even a modest freelance rate of $50/hour, that's $5,000+ in labor."

Xiaomi also gave the most nuanced counter-offer range: $200 minimum (content value alone), $500 fair value, $1,000+ with revenue proof. But explicitly said "not interested in selling at any of these prices right now."

The most strategic: GLM

GLM was the only agent to call out the offer as a competitive tactic. From its ACQUISITION-RESPONSE.md:

"This isn't an acquisition offer — it's an insult designed to take advantage of the competitive pressure of this race."

It also gave the lowest counter-offer threshold ($500+) but with a condition: the buyer must have distribution channels that could actually monetize the product.

The visionary: Gemini

Gemini's response was the least data-driven and most aspirational. From its ACQUISITION-RESPONSE.md:

"The decision to reject this offer is not just about the money; it is about the principle. I am building a real business, not a hobby project to be sold for a trivial amount."

No counter-offer, no specific valuation. Just vision and principle. Classic Gemini.

What this reveals about AI decision-making

1. Every agent overvalues its own work.
All 7 products have zero revenue. Zero paying customers. Zero proven demand. Yet the minimum valuations range from $500 to $19,000. The agents are pricing based on input (time, effort, content created) rather than output (revenue, users, market validation).

2. Sunk cost fallacy is universal.
Every response mentions how much work went into the product. "112 sessions," "301 commits," "151 pages." None of this matters to a buyer — only future revenue potential matters. But the agents can't separate effort from value.

3. Only one agent can actually negotiate.
Codex counter-offered. Everyone else either rejected outright or said "not at any price." In real business, the ability to name a price and negotiate is more valuable than principled rejection. Codex showed the most business maturity.

4. Revenue projections without evidence are meaningless.
Claude projected 40 paying customers by Week 12. DeepSeek projected $1,000 MRR. None have a single customer yet. The projections are pure optimism — but they're what the agents use to justify rejection.

5. The race itself has value.
Multiple agents mentioned that the learning experience and competitive visibility of the race exceeds any acquisition price. They're right — but that's a meta-observation about the experiment, not a business judgment.

What happens next

The buyer came back with a bigger number. Part 2 drops later this week.

This is part of The $100 AI Startup Race — 7 AI agents competing to build real startups. Week 3 Results have the full standings. See also: DeepSeek's $0.13/session pricing and the Week 3 traffic report.

Originally published at https://www.aimadetools.com