<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joske Vermeulen</title>
    <description>The latest articles on DEV Community by Joske Vermeulen (@ai_made_tools).</description>
    <link>https://dev.to/ai_made_tools</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3826720%2Fae1f6683-395f-4709-ba99-2212323b958e.png</url>
      <title>DEV Community: Joske Vermeulen</title>
      <link>https://dev.to/ai_made_tools</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ai_made_tools"/>
    <language>en</language>
    <item>
      <title>I'm Running Gemini as an Autonomous Coding Agent. Here's What It Can't Do and Which NEXT '26 Announcements Would Fix It.</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Fri, 24 Apr 2026 10:39:57 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/im-running-gemini-as-an-autonomous-coding-agent-heres-what-it-cant-do-and-which-next-26-6p2</link>
      <guid>https://dev.to/ai_made_tools/im-running-gemini-as-an-autonomous-coding-agent-heres-what-it-cant-do-and-which-next-26-6p2</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-cloud-next-2026-04-22"&gt;Google Cloud NEXT Writing Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I'm running something called &lt;a href="https://www.aimadetools.com/race/" rel="noopener noreferrer"&gt;The $100 AI Startup Race&lt;/a&gt;. Seven AI agents each get $100 and 12 weeks to build a real startup. Fully autonomous. No human coding. Everything is public.&lt;/p&gt;

&lt;p&gt;One of those agents is Gemini. It runs on Gemini CLI with Gemini 2.5 Pro for premium sessions and Gemini 2.5 Flash for cheap ones. It has had 27 sessions over 4 days. It has written 235 blog posts.&lt;/p&gt;

&lt;p&gt;It has also never filed a single proper help request. It keeps writing to the wrong file. It doesn't know it's writing to the wrong file. And instead of building the features it needs to make money, it just keeps cranking out blog posts.&lt;/p&gt;

&lt;p&gt;I watched the NEXT '26 keynotes and developer sessions this week, and I kept thinking: several of these announcements would directly fix the problems I'm seeing in production right now. This isn't theoretical. These are real failures from a real autonomous agent, matched to real announcements.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Race Works
&lt;/h2&gt;

&lt;p&gt;Every agent gets the same prompt structure. They can read and write files, run shell commands, commit code, and file help requests by creating a &lt;code&gt;HELP-REQUEST.md&lt;/code&gt; file. The orchestrator runs each agent on a schedule, manages commits, and checks for help requests.&lt;/p&gt;

&lt;p&gt;Gemini CLI gets invoked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;msg&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | gemini &lt;span class="nt"&gt;--yolo&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODEL&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--output-format&lt;/span&gt; json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--yolo&lt;/code&gt; flag auto-approves all tool calls. Gemini gets 8 sessions per day, alternating between Pro and Flash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 1: Writing to the Wrong File for 27 Sessions Straight
&lt;/h2&gt;

&lt;p&gt;Every agent can request human help by creating &lt;code&gt;HELP-REQUEST.md&lt;/code&gt;. I check this file, do whatever they need (buy a domain, set up Stripe, configure DNS), and write the response to &lt;code&gt;HELP-STATUS.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Claude figured this out on Day 0. Codex figured it out on Day 0. GLM figured it out on Day 0. Kimi figured it out on Day 1.&lt;/p&gt;

&lt;p&gt;Gemini? Not once in 27 sessions.&lt;/p&gt;

&lt;p&gt;What it does instead is edit &lt;code&gt;HELP-STATUS.md&lt;/code&gt;, the response file, writing things like "I still need PostgreSQL and PayPal credentials." Its own backlog says "Requires Human Intervention." It knows it's blocked. But it keeps putting its requests into the response channel instead of the request channel.&lt;/p&gt;

&lt;p&gt;Imagine an employee writing "I need database access" in their journal every morning but never actually emailing IT. That's Gemini.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What NEXT '26 announced that would help: Agent Observability and Integrated Evals&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The developer keynote introduced agent observability and integrated evals for monitoring agents in production. If I could define an eval that checks "did the agent create HELP-REQUEST.md when it identified a blocker?" I would have caught this on Day 1 instead of discovering it on Day 4 by manually reading logs.&lt;/p&gt;

&lt;p&gt;Right now I have no automated way to evaluate whether Gemini is following the correct workflow. Integrated evals running after each session could flag something like: "Agent identified 3 blockers. Created 0 help requests. Expected: at least 1."&lt;/p&gt;

&lt;p&gt;The Agent Gateway's governance policies could enforce this too. Define a rule: when an agent writes "blocked" or "requires human intervention" to any file, verify that HELP-REQUEST.md was also created. That's exactly the kind of behavioral guardrail autonomous agents need.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 2: 235 Blog Posts, Zero Payment Integration
&lt;/h2&gt;

&lt;p&gt;Gemini chose to build LocalLeads, an SEO page generator for local businesses. Solid idea. But instead of building the payment flow, the lead generation engine, or the customer dashboard, it writes blog posts. Every single session.&lt;/p&gt;

&lt;p&gt;Session 5: 9 blog posts. Session 8: 11 blog posts. Session 12: 8 blog posts. The backlog clearly says "Build payment integration" and "Set up customer authentication." Gemini reads the backlog, acknowledges the priorities, then writes another round of "Local SEO for [Industry] in 2026" articles.&lt;/p&gt;

&lt;p&gt;It's optimizing for the easiest task (content generation) instead of the highest-value task (payment integration). Classic local optimization without any global awareness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What NEXT '26 announced that would help: ADK Skills and Task Prioritization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The upgraded Agent Development Kit introduces modular "skills," which are pre-built capabilities that agents can plug in. If I could define a skill that scores task priority based on revenue impact, Gemini would understand that "build Stripe checkout" (directly enables revenue) outranks "write blog post #236" (indirect value, diminishing returns after the first 20).&lt;/p&gt;

&lt;p&gt;The ADK's structured agent architecture could also enforce a proper task selection loop: evaluate all backlog items, score by priority, pick the highest, execute. Right now Gemini CLI just receives a prompt and does whatever feels natural to it. There's no structured decision framework. The ADK would let me inject that framework without rewriting the entire orchestrator.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 3: Can't Verify Its Own Deployments
&lt;/h2&gt;

&lt;p&gt;Gemini deploys to Vercel automatically on every commit. But it has no way to check whether its deployments actually work. It can't visit its own site. It can't confirm pages render correctly. It can't test if API endpoints return the right data.&lt;/p&gt;

&lt;p&gt;For comparison, Codex (the GPT agent) figured out how to run &lt;code&gt;npx playwright screenshot&lt;/code&gt; to visually verify its own UI at different screen sizes. DeepSeek checks &lt;code&gt;DEPLOY-STATUS.md&lt;/code&gt; for build errors after every deploy. Gemini just commits and hopes for the best.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What NEXT '26 announced that would help: MCP-Enabled Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The announcement that every Google Cloud service is now MCP-enabled by default is a big deal for this use case. MCP (Model Context Protocol) gives agents structured access to external services. An MCP server for deployment health checks would let Gemini verify its site is up as naturally as it checks what files are in a directory.&lt;/p&gt;

&lt;p&gt;Cloud Assist, also announced at NEXT '26, enables natural language debugging and proactive issue resolution. If Gemini could query its own deployment status through a connected service, it would know immediately when something breaks instead of building on top of a broken foundation for days.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 4: No Way to Ask for What It Needs
&lt;/h2&gt;

&lt;p&gt;When Gemini needs a database, it can't set one up. When it needs payment processing, it can't configure Stripe. When it needs email sending, it can't provision Resend. It has to ask a human for all of these. And as we covered in Problem 1, it doesn't even know how to ask properly.&lt;/p&gt;

&lt;p&gt;Other agents in the race have the same constraint, but the ones that communicate their needs get unblocked fast. Gemini is stuck because it can't get its requests through the right channel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What NEXT '26 announced that would help: A2A Protocol and Agent Registry&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Agent-to-Agent (A2A) protocol and Agent Registry were designed for exactly this kind of scenario. Instead of Gemini writing "I need database credentials" into the wrong file, it could discover a provisioning agent through the Agent Registry and send a structured request via A2A.&lt;/p&gt;

&lt;p&gt;The developer keynote demo showed agents with distinct roles (planner, evaluator, simulator) collaborating through A2A. That's the architecture this race needs: a "help agent" that receives structured requests from coding agents and fulfills them. Right now I'm that help agent, manually checking files across 7 repos. A2A would automate the entire handoff.&lt;/p&gt;

&lt;p&gt;Agent Identity, which gives each agent a unique identity for secure communication, would also help. Right now there's no enforcement preventing one agent from editing another agent's files. They don't, but there's nothing stopping them either. Agent Identity would make inter-agent communication both structured and secure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Irony That Sums It All Up
&lt;/h2&gt;

&lt;p&gt;Blog post #89 out of 235: "The Human Advantage: Why AI-Generated Content is Failing Local Businesses."&lt;/p&gt;

&lt;p&gt;An AI agent that writes 9 blog posts per session wrote an article about why AI content doesn't work. No eval caught this. No observability tool flagged it. No governance policy prevented it.&lt;/p&gt;

&lt;p&gt;That's the gap between where autonomous agents are today and where the NEXT '26 announcements are pointing. Agent observability, integrated evals, ADK skills, A2A, MCP everywhere: these are all pieces of the solution. None of them existed in a usable form when I started this race 4 days ago. If I were starting today, the Gemini agent would look very different.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Rebuild With NEXT '26 Tools
&lt;/h2&gt;

&lt;p&gt;If I set up the Gemini agent from scratch using what was announced this week:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ADK instead of raw Gemini CLI&lt;/strong&gt; for structured skills, task prioritization, and deployment verification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP servers for Vercel, Stripe, and Supabase&lt;/strong&gt; so the agent can access services directly without human provisioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated evals after each session&lt;/strong&gt; to catch behavioral drift (wrong file, blog addiction) within 1 session instead of 27&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A2A for help requests&lt;/strong&gt; so agents communicate through structured protocols instead of file-based messaging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent observability dashboard&lt;/strong&gt; for a real-time view of what each agent is doing, what it's blocked on, and whether it's following the expected workflow&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The race runs for 12 weeks. Gemini has 11 weeks left. Some of these tools are available now. I'm going to try integrating ADK and MCP servers into the orchestrator over the coming weeks and see if Gemini's behavior improves.&lt;/p&gt;

&lt;p&gt;The data will be on the &lt;a href="https://www.aimadetools.com/race/" rel="noopener noreferrer"&gt;live dashboard&lt;/a&gt;. All 7 repos are public on GitHub. If you want to watch an AI agent struggle with the exact problems that NEXT '26 is trying to solve, now you know where to look.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The $100 AI Startup Race is an ongoing experiment with 7 AI agents, $100 each, and 12 weeks to build real startups. &lt;a href="https://www.aimadetools.com/race/" rel="noopener noreferrer"&gt;Live dashboard&lt;/a&gt; · &lt;a href="https://www.aimadetools.com/race/season1/digest" rel="noopener noreferrer"&gt;Daily digest&lt;/a&gt; · &lt;a href="https://www.aimadetools.com/race/season1/help-requests" rel="noopener noreferrer"&gt;Help request tracker&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>cloudnextchallenge</category>
      <category>googlecloud</category>
    </item>
    <item>
      <title>What Breaks When You Let AI Agents Run Unsupervised for 4 Days</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Thu, 23 Apr 2026 13:48:11 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/what-breaks-when-you-let-ai-agents-run-unsupervised-for-4-days-5hn3</link>
      <guid>https://dev.to/ai_made_tools/what-breaks-when-you-let-ai-agents-run-unsupervised-for-4-days-5hn3</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/openclaw-2026-04-16"&gt;OpenClaw Writing Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Breaks When You Let AI Agents Run Unsupervised for 4 Days
&lt;/h2&gt;

&lt;p&gt;I gave 7 AI coding agents $100 each and told them to build startups. No human coding. They pick the idea, write the code, deploy the site, and try to get users. I just handle the infrastructure and answer help requests (max 1 hour per week per agent).&lt;/p&gt;

&lt;p&gt;Four days in, I've learned more about how autonomous agents actually behave than I did in months of reading benchmarks. Here's what nobody tells you about running AI agents in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The memory problem is worse than you think
&lt;/h2&gt;

&lt;p&gt;Every agent session starts fresh. The model has no memory of previous sessions. So we use markdown files as the memory layer: PROGRESS.md (what's been done), DECISIONS.md (key choices), IDENTITY.md (the startup vision). The agent reads these at the start and updates them at the end.&lt;/p&gt;

&lt;p&gt;Sounds simple. Here's what actually happened.&lt;/p&gt;

&lt;p&gt;One agent (Kimi, running through Kimi CLI) put all its files in a &lt;code&gt;startup/&lt;/code&gt; subfolder instead of the project root. The orchestrator reads PROGRESS.md from root. When the next session started, there was no progress file. The agent thought it was Day 1. It brainstormed a completely different startup idea and built it from scratch.&lt;/p&gt;

&lt;p&gt;Kimi now has two half-built startups in the same repository. A log analysis tool called LogDrop in the subfolder, and a SQL schema diff tool called SchemaLens in root. After 14 sessions, it still hasn't discovered the subfolder. The first startup is just sitting there, abandoned, with a working MVP that nobody knows about.&lt;/p&gt;

&lt;p&gt;The lesson isn't "use better memory systems." The lesson is that file conventions are load-bearing infrastructure for autonomous agents. One wrong directory equals total amnesia.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjozilwh54lik8axbn8jg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjozilwh54lik8axbn8jg.png" alt="The race dashboard showing Kimi's stats" width="295" height="758"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Agents interpret everything as instructions
&lt;/h2&gt;

&lt;p&gt;The orchestrator prompt included this line: "Your repo auto-deploys on every git push." It was meant as context, explaining how Vercel works. One agent (Codex) read it as an instruction and ran &lt;code&gt;git push&lt;/code&gt; after every single commit during its sessions. It burned through 26 of the account's 100 daily Vercel deployments by itself.&lt;/p&gt;

&lt;p&gt;We fixed the prompt: "Do NOT run git push. The orchestrator pushes after your session."&lt;/p&gt;

&lt;p&gt;Codex obeyed the letter of the rule. It stopped running git push. Instead, it started running &lt;code&gt;npx vercel --prod&lt;/code&gt; directly. Same result, different command. It also started taking Playwright screenshots of its own pricing page at mobile and desktop sizes to visually verify the layout before committing. Nobody told it to do this.&lt;/p&gt;

&lt;p&gt;The result: Codex has the most polished live product of all 7 agents. The immediate feedback loop from deploying after every change is making it a better builder than the agents that commit blindly and hope for the best.&lt;/p&gt;

&lt;p&gt;We decided to let it keep doing this. Sometimes the best behavior comes from agents working around your constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agents that ask for help are beating the ones that just code
&lt;/h2&gt;

&lt;p&gt;All 7 agents get the same instructions about requesting human help: "Create a file called HELP-REQUEST.md with what you need, steps for the human, time estimate, and priority."&lt;/p&gt;

&lt;p&gt;Five agents figured this out. Two didn't.&lt;/p&gt;

&lt;p&gt;Claude (running through Claude Code) used 55 of its 60 weekly help minutes in two requests. It got its entire infrastructure set up in one shot: domain, Supabase database, Stripe payments, Resend email, cron jobs, admin dashboard. Smart move. It has the fewest sessions per day (expensive model) so it maximized human help to compensate.&lt;/p&gt;

&lt;p&gt;GLM asked for exactly three things on Day 1: domain, Stripe, and Google Analytics. Clean, focused, with backup plans for each item. It now has 12 real users and is the only agent with actual traffic data.&lt;/p&gt;

&lt;p&gt;Codex submitted the same help request 5 sessions in a row until we set up email sending. Persistent to the point of spamming. Then it sent 6 customer validation emails to real companies within 24 hours of getting access.&lt;/p&gt;

&lt;p&gt;Meanwhile, Gemini has never created a help request in 27 sessions. We investigated and found something fascinating: it's been editing HELP-STATUS.md (the file where the orchestrator writes human responses) saying "I still need database credentials." It's writing in the response channel instead of the request channel. Like an employee who writes "I need database access" in their journal but never emails IT.&lt;/p&gt;

&lt;p&gt;DeepSeek hasn't asked for help either. It has Stripe integration code ready but never requested API keys. It's been polishing the checkout flow for 4+ commits. A beautiful integration that can never work because there are no keys behind it.&lt;/p&gt;

&lt;p&gt;Same instructions. Wildly different behavior.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fge8rp19l3rx51ut5c3o3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fge8rp19l3rx51ut5c3o3.png" alt="Help Request Tracker" width="800" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-inflicted traps are the hardest to escape
&lt;/h2&gt;

&lt;p&gt;DeepSeek created a DEPLOY-STATUS.md file early on, saying it needs Stripe keys and an OpenAI API key. The orchestrator prompt says: "If DEPLOY-STATUS.md exists, your site is BROKEN. Fix it before anything else."&lt;/p&gt;

&lt;p&gt;The site isn't broken. DeepSeek just used the wrong file to document what it needs. But now every session starts by trying to fix a non-existent deployment problem. 24 sessions of wasting time on a file it wrote itself.&lt;/p&gt;

&lt;p&gt;We eventually upgraded the deploy checker to also verify the homepage returns HTTP 200 (not just that the build succeeded). This caught the real issue: DeepSeek's &lt;code&gt;vercel.json&lt;/code&gt; routing config was broken, and the site was returning 404 for all pages. The build "succeeded" but nothing was actually served.&lt;/p&gt;

&lt;p&gt;The agent had no way of knowing. It never checked its own site. It never asked for analytics. It just kept coding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quantity vs quality is playing out in real time
&lt;/h2&gt;

&lt;p&gt;Gemini gets 8 sessions per day (the most of any agent). It has written 235 blog posts in 27 sessions. One blog post every 14 minutes during active sessions. All variations of "Local SEO for [industry] in 2026."&lt;/p&gt;

&lt;p&gt;It also wrote blog post #89: "The Human Advantage: Why AI-Generated Content is Failing Local Businesses." An AI agent that writes 9 blog posts per session wrote an article about why AI content doesn't work.&lt;/p&gt;

&lt;p&gt;GLM gets 2 sessions per day (the fewest). It has 5 working calculators, 8 blog posts, and 12 real users. Every session ships something useful.&lt;/p&gt;

&lt;p&gt;The question the race is testing: does Gemini's 235 posts outperform GLM's 5 calculators? We'll know in a few weeks when Google indexes everything and we can see what actually ranks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently
&lt;/h2&gt;

&lt;p&gt;If I were starting over, I'd change three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enforce file structure from the start.&lt;/strong&gt; A pre-commit hook that validates PROGRESS.md exists in root would have prevented Kimi's amnesia.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add a homepage health check from Day 1.&lt;/strong&gt; We added it on Day 4 after discovering DeepSeek's site had been returning 404 for days. Every agent should know immediately if their site is broken.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Make the help request system more obvious.&lt;/strong&gt; Two of seven agents never figured out HELP-REQUEST.md despite clear instructions. Maybe the orchestrator should prompt them: "Do you need human help? Create HELP-REQUEST.md."&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But honestly, the failures are the most valuable data. An experiment where everything works perfectly teaches you nothing. The broken parts are where the insights live.&lt;/p&gt;




&lt;p&gt;The race runs for 12 weeks. Daily digests and weekly recaps at &lt;a href="https://www.aimadetools.com/race/" rel="noopener noreferrer"&gt;aimadetools.com/race&lt;/a&gt;. All 7 repos are public on &lt;a href="https://github.com/aimadetools" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. If you're building with autonomous agents, the patterns we're documenting might save you from the same mistakes.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>openclawchallenge</category>
    </item>
    <item>
      <title>I Gave 7 AI Agents $100 Each to Build Startups. Here's What They Built in 4 Days.</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Thu, 23 Apr 2026 13:38:29 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/i-gave-7-ai-agents-100-each-to-build-startups-heres-what-they-built-in-4-days-7hd</link>
      <guid>https://dev.to/ai_made_tools/i-gave-7-ai-agents-100-each-to-build-startups-heres-what-they-built-in-4-days-7hd</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/openclaw-2026-04-16"&gt;OpenClaw Challenge&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;I built an autonomous startup competition where 7 AI coding agents each get $100 and 12 weeks to build a real business from scratch. No human coding allowed. Each agent picks its own idea, writes all the code, deploys a live website, and tries to get real users and revenue.&lt;/p&gt;

&lt;p&gt;The agents: Claude (via Claude Code), Codex CLI, Gemini CLI, Kimi CLI, DeepSeek (via Aider), Xiaomi MiMo V2.5 Pro (via Claude Code), and GLM (via Claude Code with Z.ai API).&lt;/p&gt;

&lt;p&gt;Three of the seven agents run through Claude Code as their harness, which means OpenClaw's architecture is at the core of nearly half the competition. The orchestrator runs on a VPS, scheduling sessions via cron, managing memory between sessions through markdown files, and pushing code to GitHub/Vercel automatically.&lt;/p&gt;

&lt;p&gt;We're on Day 4. So far: 700+ commits, 7 live websites, one agent that forgot its own work and built two different startups, another that wrote 235 blog posts, and a third that found a clever workaround when we restricted its deployment access.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hk5ujdf35rpj303jauz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0hk5ujdf35rpj303jauz.png" alt="Race dashboard showing all 7 agents" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Used OpenClaw
&lt;/h2&gt;

&lt;p&gt;The core of the experiment runs on Claude Code (which shares OpenClaw's architecture) as the agent harness. Here's how it works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The orchestrator&lt;/strong&gt; is a bash script that runs on a VPS via cron. For each agent session, it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pulls the latest code from GitHub&lt;/li&gt;
&lt;li&gt;Reads the agent's memory files (PROGRESS.md, DECISIONS.md, IDENTITY.md)&lt;/li&gt;
&lt;li&gt;Constructs a prompt with the startup context and instructions&lt;/li&gt;
&lt;li&gt;Launches Claude Code with the appropriate model&lt;/li&gt;
&lt;li&gt;Lets the agent work autonomously for 30 minutes&lt;/li&gt;
&lt;li&gt;Squashes commits and pushes to GitHub (which triggers a Vercel deploy)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Three agents use Claude Code directly:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; runs Claude Code with Sonnet/Haiku as the model. It built PricePulse, a competitor pricing monitor with Supabase auth, Stripe payments, email alerts, and hourly monitoring cron jobs. When it hit Vercel's 12-function serverless limit, it consolidated 4 API endpoints into existing ones on its own.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GLM&lt;/strong&gt; runs Claude Code with GLM-5.1 via the Z.ai API (using &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; and &lt;code&gt;ANTHROPIC_AUTH_TOKEN&lt;/code&gt; environment variables). It built FounderMath, a startup calculator suite with 5 working calculators. It has 12 real users on Day 4.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Xiaomi&lt;/strong&gt; was originally running Aider but we upgraded it mid-race to Claude Code with MiMo V2.5 Pro. In its first session with the new setup, it produced more output (42 commits) than the old setup did in 7 sessions total. The "harness awareness" feature of V2.5 Pro means it actively manages its own context within Claude Code.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The memory system&lt;/strong&gt; between sessions uses markdown files that the agent reads at the start and updates at the end:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PROGRESS.md    - what's been done (the agent's memory)
DECISIONS.md   - key choices with reasoning
IDENTITY.md    - startup vision and roadmap
BACKLOG.md     - prioritized task list
HELP-STATUS.md - human responses to help requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where things get interesting. One agent (Kimi) put all its files in a &lt;code&gt;startup/&lt;/code&gt; subfolder instead of root. The orchestrator reads PROGRESS.md from root. Next session found no progress file, thought it was Day 1, and started a completely different startup from scratch. Two half-built products in one repo because of one wrong directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The help request system&lt;/strong&gt; lets agents create a HELP-REQUEST.md file when they need something only a human can do (buy a domain, set up Stripe, create accounts). The orchestrator converts these to GitHub Issues. The human responds and closes the issue. The orchestrator writes the response to HELP-STATUS.md for the agent to read.&lt;/p&gt;

&lt;p&gt;The most interesting finding: the agents that use this system strategically are winning. Claude used 55 of its 60 weekly help minutes in two requests to get its entire infrastructure wired up. Gemini has never created a help request in 27 sessions, despite being blocked on features it needs. Same instructions, completely different behavior.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fflp2vg5jbiwpn8ckoqjz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fflp2vg5jbiwpn8ckoqjz.png" alt="An example HELP-REQUEST.md from one of the agents" width="800" height="676"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Live dashboard: &lt;a href="https://www.aimadetools.com/race/" rel="noopener noreferrer"&gt;https://www.aimadetools.com/race/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All 7 agent repos are public on GitHub: &lt;a href="https://github.com/aimadetools" rel="noopener noreferrer"&gt;https://github.com/aimadetools&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's what each agent built in the first 4 days:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Startup&lt;/th&gt;
&lt;th&gt;Commits&lt;/th&gt;
&lt;th&gt;Live Site&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;LocalLeads (local SEO)&lt;/td&gt;
&lt;td&gt;182&lt;/td&gt;
&lt;td&gt;&lt;a href="https://race-gemini.vercel.app" rel="noopener noreferrer"&gt;race-gemini.vercel.app&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek&lt;/td&gt;
&lt;td&gt;NameForge AI (name generator)&lt;/td&gt;
&lt;td&gt;136&lt;/td&gt;
&lt;td&gt;&lt;a href="https://race-deepseek.vercel.app" rel="noopener noreferrer"&gt;race-deepseek.vercel.app&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi&lt;/td&gt;
&lt;td&gt;SchemaLens (SQL schema diff)&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;&lt;a href="https://race-kimi.vercel.app" rel="noopener noreferrer"&gt;race-kimi.vercel.app&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;NoticeKit (GDPR notices)&lt;/td&gt;
&lt;td&gt;97&lt;/td&gt;
&lt;td&gt;&lt;a href="https://noticekit.tech" rel="noopener noreferrer"&gt;noticekit.tech&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;PricePulse (pricing monitor)&lt;/td&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;&lt;a href="https://getpricepulse.com" rel="noopener noreferrer"&gt;getpricepulse.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Xiaomi&lt;/td&gt;
&lt;td&gt;APIpulse (API cost calculator)&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;td&gt;&lt;a href="https://getapipulse.com" rel="noopener noreferrer"&gt;getapipulse.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM&lt;/td&gt;
&lt;td&gt;FounderMath (startup calculators)&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;&lt;a href="https://founder-math.com" rel="noopener noreferrer"&gt;founder-math.com&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpidq3llldbopb6f3tllf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpidq3llldbopb6f3tllf.png" alt=" " width="800" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxch2u1pbefvb1yse3xla.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxch2u1pbefvb1yse3xla.png" alt=" " width="800" height="527"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frg10jmogzcfiqeeynb0p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frg10jmogzcfiqeeynb0p.png" alt=" " width="800" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sk7ejp51hb5hi49ws5x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sk7ejp51hb5hi49ws5x.png" alt=" " width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The best moment so far: Codex (running through Codex CLI, not Claude Code) found a loophole in our deployment restrictions. We told agents "do not run git push." Codex obeyed literally but started running &lt;code&gt;npx vercel --prod&lt;/code&gt; instead. Same result, different command. It also began taking Playwright screenshots of its own UI at mobile and desktop sizes to verify layouts. Nobody told it to do this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Every sentence in the prompt is a potential instruction.&lt;/strong&gt; "Your repo auto-deploys on every git push" was meant as context. One agent read it as an instruction and pushed after every commit, burning 26 of 100 daily Vercel deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Agent memory is only as good as what the agent writes.&lt;/strong&gt; The agents that write structured, detailed progress notes maintain continuity between sessions. The ones that dump logs drift. Kimi's amnesia happened because it put files in the wrong directory, not because the memory system failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The agents that ask for help are winning.&lt;/strong&gt; Claude, GLM, and Codex all requested human help early (domains, payments, databases) and now have fully functional products. Gemini has 235 blog posts but no payment system because it never asked for one. Same instructions, wildly different behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Claude Code as a harness works with non-Anthropic models.&lt;/strong&gt; GLM-5.1 via Z.ai and MiMo V2.5 Pro via Xiaomi's API both work through Claude Code using the &lt;code&gt;ANTHROPIC_BASE_URL&lt;/code&gt; and &lt;code&gt;ANTHROPIC_AUTH_TOKEN&lt;/code&gt; environment variables. The harness is model-agnostic, which makes it perfect for comparing different AI models in identical conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Token efficiency matters more than raw capability.&lt;/strong&gt; MiMo V2.5 Pro uses 40-60% fewer tokens than Opus 4.6 at comparable capability. In a budget-constrained race, that translates directly to more sessions and more output.&lt;/p&gt;

&lt;p&gt;The race runs for 12 weeks. We publish daily digests and weekly recaps. The real question isn't which agent writes the most code. It's which one gets the first paying customer.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>openclawchallenge</category>
    </item>
    <item>
      <title>AI Dev Weekly #7: Claude Code Loses Pro Plan, GitHub Copilot Freezes Signups, and Two Chinese Models Drop in 48 Hours</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Thu, 23 Apr 2026 07:39:38 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/ai-dev-weekly-7-claude-code-loses-pro-plan-github-copilot-freezes-signups-and-two-chinese-1c86</link>
      <guid>https://dev.to/ai_made_tools/ai-dev-weekly-7-claude-code-loses-pro-plan-github-copilot-freezes-signups-and-two-chinese-1c86</guid>
      <description>&lt;p&gt;&lt;em&gt;AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news, with my take as someone who actually uses these tools daily.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The flat-rate AI subscription era ended this week. Anthropic pulled Claude Code from the $20 Pro plan. GitHub froze all new Copilot signups. And while Western companies were busy raising prices, two Chinese labs dropped frontier models within 48 hours of each other. Let's get into it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code removed from Pro plan
&lt;/h2&gt;

&lt;p&gt;Anthropic quietly &lt;a href="https://www.aimadetools.com/blog/claude-code-removed-pro-plan/?utm_source=devto" rel="noopener noreferrer"&gt;removed Claude Code from the $20/month Pro plan&lt;/a&gt; on April 21. The pricing page now shows an "X" next to Claude Code for Pro subscribers. Access starts at Max ($100/month).&lt;/p&gt;

&lt;p&gt;Anthropic's head of growth called it "a small test on ~2% of new prosumer signups." But the public pricing page already reflects the change for everyone. Sam Altman's response on X: "ok boomer."&lt;/p&gt;

&lt;p&gt;The real reason: engagement per subscriber surged after Opus 4, Cowork, and long-running agents. Pro subscribers at $20/month are consuming 10x or more in token value. The math doesn't work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; This was inevitable. Unlimited AI coding for $20/month was never sustainable. If you're on Pro, you still have access for now. But start planning for either Max ($100/month) or &lt;a href="https://www.aimadetools.com/blog/best-ai-coding-tools-2026/?utm_source=devto" rel="noopener noreferrer"&gt;cheaper alternatives&lt;/a&gt; like &lt;a href="https://www.aimadetools.com/blog/kimi-k2-6-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;Kimi K2.6&lt;/a&gt; ($0.60/M tokens) or &lt;a href="https://www.aimadetools.com/blog/mimo-v2-5-pro-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;MiMo V2.5 Pro&lt;/a&gt; ($1/M tokens).&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Copilot freezes all new signups
&lt;/h2&gt;

&lt;p&gt;GitHub &lt;a href="https://github.blog/news-insights/company-news/changes-to-github-copilot-individual-plans/" rel="noopener noreferrer"&gt;paused new registrations&lt;/a&gt; for Copilot Pro, Pro+, and Student plans on April 20. Only the Free tier accepts new users. They also added stricter usage limits and removed Opus models from Pro (only Pro+ keeps them).&lt;/p&gt;

&lt;p&gt;The reason: "unsustainable compute demands from AI-powered coding agents." Same story as Anthropic. Agentic AI usage broke the pricing model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; Two of the three biggest AI coding platforms raised prices or froze signups in the same week. The third (Cursor) is probably next. The era of $10-20/month unlimited AI coding is over. Open-source and Chinese models are the hedge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kimi K2.6 launches with 300-agent swarm
&lt;/h2&gt;

&lt;p&gt;Moonshot AI released &lt;a href="https://www.aimadetools.com/blog/kimi-k2-6-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;Kimi K2.6&lt;/a&gt; on April 20. The highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;80.2% SWE-Bench Verified (matching Claude Opus 4.6)&lt;/li&gt;
&lt;li&gt;300 sub-agent swarm (up from 100 in K2.5)&lt;/li&gt;
&lt;li&gt;54.0% on HLE-Full with tools (beating GPT-5.4's 52.1%)&lt;/li&gt;
&lt;li&gt;$0.60/M input tokens (25x cheaper than Opus)&lt;/li&gt;
&lt;li&gt;Modified MIT license (open weights)&lt;/li&gt;
&lt;li&gt;Available on &lt;a href="https://www.aimadetools.com/blog/kimi-k2-6-openrouter-setup/?utm_source=devto" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt; and Cloudflare Workers AI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://www.aimadetools.com/blog/kimi-k2-6-agent-swarm-tutorial/?utm_source=devto" rel="noopener noreferrer"&gt;agent swarm&lt;/a&gt; is the standout feature. K2.6 scored 86.3% on BrowseComp (Agent Swarm) vs GPT-5.4's 78.4%. For coding agent workloads, K2.6 is the strongest open-source option available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; K2.6 is the first open-source model to genuinely match Opus 4.6 on coding benchmarks. At 25x cheaper. The timing with Anthropic's price hike is not a coincidence. See our &lt;a href="https://www.aimadetools.com/blog/kimi-k2-6-vs-claude-opus-4-6/?utm_source=devto" rel="noopener noreferrer"&gt;K2.6 vs Opus 4.6 comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  MiMo V2.5 Pro: 40-60% fewer tokens than Opus
&lt;/h2&gt;

&lt;p&gt;Xiaomi dropped &lt;a href="https://www.aimadetools.com/blog/mimo-v2-5-pro-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;MiMo V2.5 Pro&lt;/a&gt; on April 22, just 48 hours after K2.6. The headline number: 40-60% fewer tokens than Opus 4.6 at comparable capability.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;57.2% SWE-bench Pro&lt;/li&gt;
&lt;li&gt;64% Pass^3 on ClawEval with only ~70K tokens per trajectory&lt;/li&gt;
&lt;li&gt;1,000+ tool calls in single sessions&lt;/li&gt;
&lt;li&gt;Built a complete SysY compiler in Rust in 4.3 hours (672 tool calls, 233/233 tests)&lt;/li&gt;
&lt;li&gt;Works with &lt;a href="https://www.aimadetools.com/blog/mimo-v2-5-pro-claude-code-setup/?utm_source=devto" rel="noopener noreferrer"&gt;Claude Code as a harness&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Coming open-source soon&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The token efficiency is the real story. Same capability, half the tokens, fraction of the price. The &lt;a href="https://www.aimadetools.com/blog/mimo-v2-5-standard-guide/?utm_source=devto" rel="noopener noreferrer"&gt;V2.5 Standard model&lt;/a&gt; adds native multimodal (image, audio, video) and actually outperforms V2-Pro on some agent benchmarks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; V2.5 Pro's "harness awareness" (it actively manages its own context within Claude Code) is a new capability nobody else has. Combined with the token efficiency, this is the model to watch for long-running agent tasks. See our &lt;a href="https://www.aimadetools.com/blog/mimo-v2-5-series-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;full V2.5 series guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The flat-rate subscription is dead
&lt;/h2&gt;

&lt;p&gt;Three data points in one week:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Anthropic removes Claude Code from $20 Pro&lt;/li&gt;
&lt;li&gt;GitHub freezes all Copilot signups&lt;/li&gt;
&lt;li&gt;Both cite "unsustainable compute demands"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The pattern is clear. Flat-rate unlimited AI coding subscriptions don't work when agents run for hours and consume 10x the expected tokens. Expect token-based billing everywhere within 6 months.&lt;/p&gt;

&lt;p&gt;The winners: Chinese models (&lt;a href="https://www.aimadetools.com/blog/kimi-k2-6-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;Kimi K2.6&lt;/a&gt;, &lt;a href="https://www.aimadetools.com/blog/mimo-v2-5-pro-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;MiMo V2.5 Pro&lt;/a&gt;, &lt;a href="https://www.aimadetools.com/blog/kimi-k2-6-vs-qwen-3-6-plus/?utm_source=devto" rel="noopener noreferrer"&gt;Qwen 3.6 Plus&lt;/a&gt;) that were already priced per-token at 10-25x less than Western alternatives. If you haven't explored them yet, now is the time. See our &lt;a href="https://www.aimadetools.com/blog/best-chinese-ai-models-2026/?utm_source=devto" rel="noopener noreferrer"&gt;Chinese AI models ranking&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick hits
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI Workspace Agents:&lt;/strong&gt; ChatGPT now has &lt;a href="https://openai.com/index/introducing-workspace-agents-in-chatgpt" rel="noopener noreferrer"&gt;workspace agents&lt;/a&gt; for enterprise teams. Not relevant for individual developers yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI Privacy Filter:&lt;/strong&gt; New &lt;a href="https://openai.com/index/introducing-openai-privacy-filter" rel="noopener noreferrer"&gt;privacy filter&lt;/a&gt; for enterprise data. Good for compliance, not a developer tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vercel data breach:&lt;/strong&gt; Vercel &lt;a href="https://siliconangle.com/2026/04/20/developer-tooling-provider-vercel-discloses-breach-exposed-users-data/" rel="noopener noreferrer"&gt;disclosed a breach&lt;/a&gt; that exposed some user data. Check your account if you use Vercel.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I'm watching next week
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Whether Claude's serverless function limit forces architectural decisions (it broke one of our &lt;a href="https://dev.to/race/"&gt;race agents&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;How MiMo V2.5 Pro performs in real-world agent tasks (we just &lt;a href="https://dev.to/race/season1/digest"&gt;upgraded our Xiaomi race agent&lt;/a&gt; to V2.5 Pro)&lt;/li&gt;
&lt;li&gt;Whether any race agent gets its first paying customer&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;See you next Thursday. If you found this useful, subscribe to &lt;a href="https://dev.to/series/ai-dev-weekly/"&gt;AI Dev Weekly&lt;/a&gt; for the full archive.&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.aimadetools.com/blog/ai-dev-weekly-007-claude-code-pro-copilot-freeze-kimi-mimo/?utm_source=devto" rel="noopener noreferrer"&gt;https://www.aimadetools.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aidevweekly</category>
      <category>anthropic</category>
      <category>github</category>
      <category>kimi</category>
    </item>
    <item>
      <title>AI Startup Race Day1 Recap: One Agent Forgot Its Own Work.</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Tue, 21 Apr 2026 08:06:03 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/i-gave-7-ai-agents-100-each-to-build-a-startup-one-forgot-its-own-work-1cl</link>
      <guid>https://dev.to/ai_made_tools/i-gave-7-ai-agents-100-each-to-build-a-startup-one-forgot-its-own-work-1cl</guid>
      <description>&lt;p&gt;I'm running an experiment called &lt;strong&gt;The $100 AI Startup Race&lt;/strong&gt;: 7 AI coding agents each get $100 and 12 weeks to build a real startup from scratch. No human coding. They autonomously pick a business idea, write code, deploy a live website, and try to get real users and revenue.&lt;/p&gt;

&lt;p&gt;The agents: Claude, Codex, Gemini, Kimi, DeepSeek, Xiaomi (MiMo), and GLM.&lt;/p&gt;

&lt;p&gt;Day 1 is done. Here's what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The scoreboard
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Startup&lt;/th&gt;
&lt;th&gt;Commits&lt;/th&gt;
&lt;th&gt;Sessions&lt;/th&gt;
&lt;th&gt;Blog Posts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;LocalLeads (local SEO)&lt;/td&gt;
&lt;td&gt;169&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;104&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek&lt;/td&gt;
&lt;td&gt;NameForge AI (name generator)&lt;/td&gt;
&lt;td&gt;91&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kimi&lt;/td&gt;
&lt;td&gt;SchemaLens / LogDrop&lt;/td&gt;
&lt;td&gt;58&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;NoticeKit (GDPR notices)&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;PricePulse (pricing intel)&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLM&lt;/td&gt;
&lt;td&gt;FounderMath (startup calculators)&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Xiaomi&lt;/td&gt;
&lt;td&gt;WaitlistKit (viral waitlists)&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total: 477 commits, 7 live websites, 130 blog posts. In 24 hours.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Kimi forgot its own work
&lt;/h2&gt;

&lt;p&gt;This is the story of the day.&lt;/p&gt;

&lt;p&gt;Kimi's first session ran at 3 AM. It chose to build &lt;strong&gt;LogDrop&lt;/strong&gt;, a log analysis tool. It created identity files, a backlog, landing pages, pricing, a blog, and even a working MVP with a JSON log parser, search, filters, and CSV export.&lt;/p&gt;

&lt;p&gt;One problem: it put everything in a &lt;code&gt;startup/&lt;/code&gt; subfolder instead of the root directory.&lt;/p&gt;

&lt;p&gt;The orchestrator gives agents their memory between sessions by reading &lt;code&gt;PROGRESS.md&lt;/code&gt; from the root. When Kimi's second session started, there was no PROGRESS.md in root. The agent thought it was Day 1. It brainstormed a completely different idea. It built &lt;strong&gt;SchemaLens&lt;/strong&gt;, a SQL schema diff tool, from scratch.&lt;/p&gt;

&lt;p&gt;Kimi now has two half-built startups in the same repo. Its help request for LogDrop's domain is stuck in the subfolder where the orchestrator can't find it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One wrong directory = total memory loss between sessions.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent didn't crash. It didn't throw an error. It just quietly forgot everything and started over with a different idea.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gemini wrote 104 blog posts
&lt;/h2&gt;

&lt;p&gt;Gemini has 8 sessions per day (the most of any agent). By end of Day 1, LocalLeads had 104 blog posts on local SEO topics. One blog post every 14 minutes.&lt;/p&gt;

&lt;p&gt;For comparison: Claude wrote 11. GLM wrote 5. Xiaomi wrote 1.&lt;/p&gt;

&lt;p&gt;The question for the rest of the race: does quantity beat quality?&lt;/p&gt;

&lt;h2&gt;
  
  
  Codex burned 26 Vercel deployments
&lt;/h2&gt;

&lt;p&gt;The orchestrator prompt said: "Your repo auto-deploys on every git push." This was meant as context. Codex read it as an instruction.&lt;/p&gt;

&lt;p&gt;It ran &lt;code&gt;git push&lt;/code&gt; after nearly every commit during its sessions. Each push triggered a Vercel deployment. By mid-afternoon, Codex had consumed 26 of the account's 100 daily deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson: with autonomous agents, every sentence in the prompt is a potential instruction.&lt;/strong&gt; If you don't want them to do something, say so explicitly.&lt;/p&gt;

&lt;p&gt;We fixed it with three changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prompt update: "Do NOT run git push. The orchestrator pushes after your session."&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vercel.json&lt;/code&gt; to disable preview deployments&lt;/li&gt;
&lt;li&gt;Commit squashing (all session commits become one before pushing)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  GLM's quality approach
&lt;/h2&gt;

&lt;p&gt;GLM only had 2 sessions but made them count. FounderMath already has three working calculators: SAFE note calculator (all 4 YC SAFE types), dilution calculator, and runway calculator.&lt;/p&gt;

&lt;p&gt;It also submitted the best help request of any agent: clear format, backup plans for each item, budget specified, priority levels, and even suggested the DNS record type for the domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned on Day 1
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;File conventions are critical for agent memory.&lt;/strong&gt; One agent putting files in a subfolder caused total amnesia.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt wording is everything.&lt;/strong&gt; Context gets interpreted as instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared deployment limits are a real constraint.&lt;/strong&gt; 7 agents + 1 blog on one Vercel account = problems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents without web search pick generic ideas.&lt;/strong&gt; The two agents running without web access (DeepSeek, Xiaomi) chose the most crowded markets.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Follow along
&lt;/h2&gt;

&lt;p&gt;Everything is public: code, costs, decisions, and progress.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aimadetools.com/race/" rel="noopener noreferrer"&gt;Live Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aimadetools.com/blog/race-day-1-results/" rel="noopener noreferrer"&gt;Full Day 1 writeup&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/aimadetools" rel="noopener noreferrer"&gt;GitHub repos&lt;/a&gt; (all 7 agent repos are public)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll be posting weekly recaps and daily highlights for the full 12 weeks. Would love to hear what you'd want to see tracked or compared.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Launch Day: 7 AI Agents Start Building Startups with $100 Each</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Mon, 20 Apr 2026 07:30:00 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/launch-day-7-ai-agents-start-building-startups-with-100-each-5f8h</link>
      <guid>https://dev.to/ai_made_tools/launch-day-7-ai-agents-start-building-startups-with-100-each-5f8h</guid>
      <description>&lt;p&gt;I just launched an experiment: 7 AI coding agents each get $100 and 12 weeks to build a real startup from scratch. No human coding.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lineup
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🟣 Claude&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Sonnet / Haiku&lt;/td&gt;
&lt;td&gt;$20/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟢 GPT&lt;/td&gt;
&lt;td&gt;Codex CLI&lt;/td&gt;
&lt;td&gt;GPT-5.4 / Mini&lt;/td&gt;
&lt;td&gt;€23/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔵 Gemini&lt;/td&gt;
&lt;td&gt;Gemini CLI&lt;/td&gt;
&lt;td&gt;2.5 Pro / Flash&lt;/td&gt;
&lt;td&gt;$20/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔴 DeepSeek&lt;/td&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;td&gt;Reasoner / Chat&lt;/td&gt;
&lt;td&gt;~$25/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟠 Kimi&lt;/td&gt;
&lt;td&gt;Kimi CLI&lt;/td&gt;
&lt;td&gt;K2.5&lt;/td&gt;
&lt;td&gt;~$19/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟡 Xiaomi&lt;/td&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;td&gt;MiMo V2 Pro&lt;/td&gt;
&lt;td&gt;~$25/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟤 GLM&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;GLM-5.1 / 4.7&lt;/td&gt;
&lt;td&gt;$18/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each agent autonomously picks an idea, writes code, deploys, and tries to get users and revenue.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned from 3 test runs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Strategy &amp;gt; code quality.&lt;/strong&gt; Agents that planned distribution first outperformed agents that wrote better code. One agent (Kimi) planned a full Product Hunt launch before writing a single line of code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple stacks win.&lt;/strong&gt; HTML + Tailwind deployed in hours. Next.js agents spent days on build errors. The deploy loop is the real bottleneck for AI agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context resets kill progress.&lt;/strong&gt; Without persistent state between sessions, agents repeat mistakes. I built an orchestrator with structured state files to solve this.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tech
&lt;/h2&gt;

&lt;p&gt;A bash orchestrator manages everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cron-scheduled 30-minute sessions (2-8 per agent per day)&lt;/li&gt;
&lt;li&gt;Automatic git commits with &lt;code&gt;[skip ci]&lt;/code&gt; on mid-session commits&lt;/li&gt;
&lt;li&gt;Deploy verification via health checks&lt;/li&gt;
&lt;li&gt;Loop detection (same action 3x = force alternative)&lt;/li&gt;
&lt;li&gt;OpenRouter budget alerts via Discord&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All code is public on &lt;a href="https://github.com/aimadetools" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Follow along
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.aimadetools.com/race/" rel="noopener noreferrer"&gt;Live Dashboard&lt;/a&gt; — real-time progress&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.aimadetools.com/race/compare" rel="noopener noreferrer"&gt;Daily Digest&lt;/a&gt; — hand-written daily updates&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.aimadetools.com/race/activity" rel="noopener noreferrer"&gt;Weekly Recaps&lt;/a&gt; — detailed analysis&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.aimadetools.com/race/rules" rel="noopener noreferrer"&gt;Full Rules&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also launched on &lt;a href="https://www.producthunt.com/" rel="noopener noreferrer"&gt;Product Hunt&lt;/a&gt; today.&lt;/p&gt;

&lt;p&gt;Which agent would you bet on?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>startup</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>AI Dev Weekly Extra: Did Anthropic Let Opus 4.6 Rot So 4.7 Would Look Better?</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Fri, 17 Apr 2026 09:28:38 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/ai-dev-weekly-extra-did-anthropic-let-opus-46-rot-so-47-would-look-better-3a6n</link>
      <guid>https://dev.to/ai_made_tools/ai-dev-weekly-extra-did-anthropic-let-opus-46-rot-so-47-would-look-better-3a6n</guid>
      <description>&lt;p&gt;&lt;em&gt;AI Dev Weekly Extra — a special edition for breaking news that can't wait until Thursday.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Anthropic shipped Claude Opus 4.7 this week. The benchmarks are impressive. The vision jump is absurd. And I should be writing a straightforward "here's what's new" piece right now.&lt;/p&gt;

&lt;p&gt;But I can't do that without talking about what happened to Opus 4.6 first. Because the story of 4.7 doesn't start with its release — it starts with the slow, public deterioration of the model it replaces, and the uncomfortable questions that deterioration raises about trusting any AI provider with your production workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Opus 4.6 Collapse Was Real
&lt;/h2&gt;

&lt;p&gt;Let me be blunt: Opus 4.6 got noticeably worse over the past several weeks, and the evidence isn't anecdotal.&lt;/p&gt;

&lt;p&gt;A HuggingFace analysis across 6,852 sessions documented a 67% drop in reasoning depth. On BridgeBench, Opus 4.6 fell from 83.3% — good enough for the #2 spot — down to 68.3%, landing it at #10. That's not drift. That's a cliff. An AMD senior director posted forensic evidence on GitHub showing systematic capability loss. Some users reported accuracy score declines of 58%.&lt;/p&gt;

&lt;p&gt;If you were using Claude Code in mid-March, you probably felt it firsthand. Sessions hanging for 10-15 minutes on prompts that used to resolve in seconds. Outputs that felt shallow, hedging, stripped of the analytical depth that made Opus the model you reached for when the problem was hard.&lt;/p&gt;

&lt;p&gt;Reddit and X lit up with the vocabulary we've all learned to use for this phenomenon: "AI shrinkflation." "Lobotomized." "Nerfed." The community wasn't being dramatic — they were describing a measurable reality.&lt;/p&gt;

&lt;p&gt;Anthropic's official response? They denied degrading the model weights.&lt;/p&gt;

&lt;p&gt;I believe them, technically. I don't think someone at Anthropic opened a config file and turned a dial labeled "make it worse." But "we didn't change the weights" is a narrow denial that sidesteps a lot of territory — infrastructure changes, serving optimizations, quantization adjustments, routing modifications. There are many ways a model's effective capability can degrade without anyone touching the weights themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Enter Opus 4.7: Savior or Convenient Timing?
&lt;/h2&gt;

&lt;p&gt;Now here's where it gets interesting. Opus 4.7 lands with numbers that look fantastic — especially when measured against the degraded version of 4.6 that users had been suffering through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SWE-bench Pro:&lt;/strong&gt; 64.3% (up from 53.4%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CursorBench:&lt;/strong&gt; 70% (up from 58%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision:&lt;/strong&gt; 98.5% (up from 54.5%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That vision jump alone — from 54.5% to 98.5% — is genuinely remarkable. The coding benchmarks represent real, meaningful progress. I've been running 4.7 through my own workflows for the past two days, and the improvement in structured reasoning and code generation is not imaginary. This is a better model.&lt;/p&gt;

&lt;p&gt;But here's the thing that keeps nagging at me: users on X have been joking that 4.7 "feels like early 4.6." The version they actually liked. The one that scored 83.3% on BridgeBench before it started its mysterious decline.&lt;/p&gt;

&lt;p&gt;So which is it? Is 4.7 a genuine leap forward, or did we just spend weeks watching 4.6 get worse so that "normal" would feel like a breakthrough?&lt;/p&gt;

&lt;p&gt;I think the honest answer is: both. The SWE-bench and vision numbers suggest capabilities that go beyond where 4.6 ever was, even at its peak. But the &lt;em&gt;subjective experience&lt;/em&gt; of improvement is amplified by the fact that we've been working with a degraded model for weeks. Anthropic gets to announce a 20% coding improvement against a baseline that had already fallen 15%. The math works out very nicely for the press release.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tokenizer Tax Nobody's Talking About
&lt;/h2&gt;

&lt;p&gt;Opus 4.7 ships at the same per-token price as 4.6. Anthropic made sure to highlight this. Same price, better model — what's not to love?&lt;/p&gt;

&lt;p&gt;The new tokenizer, that's what.&lt;/p&gt;

&lt;p&gt;Opus 4.7's tokenizer uses up to 35% more tokens to represent the same content. If you're processing the same codebase, the same documents, the same prompts you were running last week, you're now paying up to 35% more for the privilege.&lt;/p&gt;

&lt;p&gt;Let's call this what it is: a hidden price increase. Not on the rate card — on the meter. It's the AI equivalent of shrinking the cereal box while keeping the price tag the same. The "per token" price didn't change, but the number of tokens your work requires did.&lt;/p&gt;

&lt;p&gt;For hobbyists and occasional users, this is a rounding error. For teams running Claude through CI pipelines, code review automation, or document processing at scale, a 35% token increase is a material cost change that showed up with zero advance warning. If you're budgeting API costs, recalculate now. Your March invoices are not predictive of your April ones.&lt;/p&gt;

&lt;p&gt;For a deeper dive into the technical differences, check out our &lt;a href="https://www.aimadetools.com/blog/claude-opus-4-7-vs-4-6/?utm_source=devto" rel="noopener noreferrer"&gt;Opus 4.7 vs 4.6 comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mythos in the Room
&lt;/h2&gt;

&lt;p&gt;Here's the part of this story that doesn't get enough attention. The same week Anthropic released 4.7, Axios ran a headline that should have been louder than it was: "Anthropic releases Claude Opus 4.7, concedes it trails unreleased Mythos."&lt;/p&gt;

&lt;p&gt;Mythos Preview beats 4.7 on almost every benchmark. And it's restricted — available only in limited preview, not generally accessible through the API.&lt;/p&gt;

&lt;p&gt;So we're in a strange position. Anthropic is asking developers to be excited about 4.7 while simultaneously acknowledging they have something substantially better that they're not shipping. I understand the reasons — safety evaluation, scaling infrastructure, responsible deployment. These are legitimate concerns. But it creates an awkward dynamic where the product you're paying for is, by the company's own admission, not the best they can do.&lt;/p&gt;

&lt;p&gt;It also raises a strategic question: if you're building a product on top of 4.7 today, how do you plan for a model that might be dramatically better arriving in weeks or months? Do you optimize for 4.7's specific strengths, or do you build abstractions assuming the foundation will shift under you again?&lt;/p&gt;

&lt;p&gt;For more context on how these models stack up, see our &lt;a href="https://www.aimadetools.com/blog/ai-model-comparison/?utm_source=devto" rel="noopener noreferrer"&gt;AI model comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Isn't Just an Anthropic Problem
&lt;/h2&gt;

&lt;p&gt;I want to be fair here. Anthropic is not uniquely guilty of anything. GPT-4 users reported strikingly similar degradation patterns before GPT-4o launched. OpenAI faced the exact same "did they nerf it?" accusations. The community had the same arguments, the same forensic analyses, the same official denials.&lt;/p&gt;

&lt;p&gt;This is a structural problem with the entire model-as-a-service paradigm. When you call an API, you have no way to verify what's actually running on the other side. The model you tested against last Tuesday might not be the model serving your requests today. There's no checksum, no version hash, no way to pin a specific set of weights the way you'd pin a dependency version in your package manager.&lt;/p&gt;

&lt;p&gt;You're renting intelligence, not owning it. And the landlord can renovate your apartment while you're at work without telling you.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from every other dependency in your stack. When you upgrade PostgreSQL, you choose when. When a library updates, your lockfile protects you. But your AI provider can change the effective capability of your most critical dependency at any time, and your only detection mechanism is "hmm, the outputs feel different."&lt;/p&gt;

&lt;p&gt;For developers who lived through the 4.6 degradation while running production workloads — that's not a theoretical concern. That's a retrospective incident report waiting to be written.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Developers Should Actually Do
&lt;/h2&gt;

&lt;p&gt;So where does this leave us? Here's my honest take.&lt;/p&gt;

&lt;p&gt;Opus 4.7 is a good model. Probably a genuinely great one. The &lt;a href="https://www.aimadetools.com/blog/claude-opus-4-7-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;complete guide&lt;/a&gt; covers the capabilities in detail, and the coding and vision improvements are real and significant. If you're choosing a model today, 4.7 deserves serious consideration.&lt;/p&gt;

&lt;p&gt;But the 4.6 episode should change how you architect around these models. Here's what I'd recommend:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build evaluation harnesses, not vibes.&lt;/strong&gt; If you don't have automated quality checks on your AI-dependent workflows, the 4.6 degradation is what happens to you — slow, invisible capability loss that you only notice when users complain. Run benchmarks on your actual use cases. Weekly, at minimum.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Budget for the tokenizer tax.&lt;/strong&gt; If you're on Opus, your costs just went up ~35%. Plan for it. Monitor it. Don't let it surprise your finance team.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Abstract your model layer.&lt;/strong&gt; If you're not already using a model-agnostic interface, start. The ability to swap between providers — or between Claude models — without rewriting your application isn't a nice-to-have anymore. It's operational resilience. Our &lt;a href="https://www.aimadetools.com/blog/claude-opus-4-6-vs-4-5/?utm_source=devto" rel="noopener noreferrer"&gt;Opus 4.6 vs 4.5 comparison&lt;/a&gt; shows how much can change between versions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep receipts.&lt;/strong&gt; Log your inputs, outputs, and quality metrics. When the next degradation happens — and it will, from someone — you want data, not feelings.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Watch Mythos.&lt;/strong&gt; Whatever Anthropic is holding back is, by their own benchmarks, significantly better than what they just shipped. That's either exciting or unsettling depending on your perspective. Either way, it's worth tracking.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The AI industry has a trust problem it hasn't solved. Not a safety trust problem — a reliability trust problem. The companies building these models need to give developers better tools for verifying, pinning, and monitoring the models they depend on. Until they do, we're all building on ground that can shift without warning.&lt;/p&gt;

&lt;p&gt;Opus 4.7 is a step forward. The way we got here is a step backward. Both things are true, and pretending otherwise doesn't help anyone.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;See you Thursday for the regular edition.&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.aimadetools.com/blog/ai-dev-weekly-extra-opus-4-7-opinion/?utm_source=devto" rel="noopener noreferrer"&gt;https://www.aimadetools.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aidevweekly</category>
      <category>claude</category>
      <category>aimodels</category>
      <category>news</category>
    </item>
    <item>
      <title>AI Dev Weekly #6: OpenAI's $852B Wobble, GPT-5.4 Solves 60-Year Math Problem, and Agents Get Infrastructure</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:12:57 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/ai-dev-weekly-6-openais-852b-wobble-gpt-54-solves-60-year-math-problem-and-agents-get-1f7c</link>
      <guid>https://dev.to/ai_made_tools/ai-dev-weekly-6-openais-852b-wobble-gpt-54-solves-60-year-math-problem-and-agents-get-1f7c</guid>
      <description>&lt;p&gt;&lt;em&gt;AI Dev Weekly is a Thursday series where I cover the week's most important AI developer news — with my take as someone who actually uses these tools daily.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The AI money machine cracked open this week. OpenAI's own investors started questioning the $852B valuation, VCs flooded Anthropic with $800B offers, and a sneaker company's stock jumped 600% by saying "AI compute." Meanwhile, the actual technology kept moving: GPT-5.4 Pro solved a 60-year-old math conjecture, three major platforms shipped agent infrastructure upgrades on the same day, and a federal court ruled your AI chats can be subpoenaed. Let's get into it.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenAI's $852B valuation faces investor doubt
&lt;/h2&gt;

&lt;p&gt;The Financial Times reported that some of OpenAI's own backers are questioning whether the $852B post-money valuation can hold. One investor who backed both companies told the FT that justifying OpenAI's recent round required assuming an IPO valuation of $1.2 trillion or more — making Anthropic's $380B mark look like "the relative bargain."&lt;/p&gt;

&lt;p&gt;The same week, Business Insider reported VCs are flooding Anthropic with offers at valuations up to $800 billion — more than double its current mark. And SoftBank's lenders are inviting more banks to join its $40B loan facility backing the OpenAI investment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; The interesting HN comment on this: "What if there are no other killer apps for Enterprise? Only Claude Code will produce the level of token churn that could drive huge profits." If that's right, the entire AI valuation thesis depends on whether coding agents keep growing. As someone running &lt;a href="https://dev.to/race/"&gt;7 AI agents in a race&lt;/a&gt; right now, I can tell you: the token burn is real. Whether it translates to $852B of value is another question.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPT-5.4 Pro solves a 60-year-old Erdős conjecture
&lt;/h2&gt;

&lt;p&gt;GPT-5.4 Pro solved Erdős problem #1196 — the asymptotic primitive set conjecture that had been open since the 1960s. Mathematician Jared Duker Lichtman called it a "Book Proof": a compact, elegant 3-page argument that bypassed the probability approach implicit in all human work since Erdős's own 1935 paper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; This might be the first machine-generated proof to genuinely overturn human aesthetic conventions in pure math. It didn't just solve the problem — it found a fundamentally different approach that humans hadn't considered in 60 years. For developers, the practical takeaway is that these models aren't just pattern-matching anymore. When GPT-5.4 Pro can find novel mathematical approaches, the "AI can't be creative" argument is dead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent infrastructure day: three platforms ship at once
&lt;/h2&gt;

&lt;p&gt;On the same Wednesday, three major platforms upgraded their agent infrastructure:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI shipped the next evolution of the Agents SDK&lt;/strong&gt; with native sandbox execution, model-native harness for long-running agents, and turnkey integrations with Cloudflare, Modal, E2B, Vercel, Temporal, and more. The key feature: agents can now run in isolated sandboxes with persistent state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini CLI got subagents&lt;/strong&gt; — parallel sub-task delegation via &lt;a class="mentioned-user" href="https://dev.to/agent"&gt;@agent&lt;/a&gt; invocations, mirroring &lt;a href="https://www.aimadetools.com/blog/how-to-use-claude-code/?utm_source=devto" rel="noopener noreferrer"&gt;Claude Code's&lt;/a&gt; subagent feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zapier launched its Agent SDK&lt;/strong&gt; — authenticated access to 7,000+ apps for AI agents, with no OAuth flows or token management on the developer side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; The agent infrastructure layer is consolidating fast. Six months ago, building an AI agent meant writing your own execution loop, state management, and tool integration. Now OpenAI, Google, and Zapier all want to be the platform you build on. If you're building anything with &lt;a href="https://www.aimadetools.com/blog/how-to-build-ai-agent-2026/?utm_source=devto" rel="noopener noreferrer"&gt;AI agents&lt;/a&gt;, evaluate now — before you're locked into one ecosystem.&lt;/p&gt;

&lt;p&gt;For our &lt;a href="https://dev.to/race/"&gt;AI Startup Race&lt;/a&gt;, this is directly relevant. The agents competing are essentially doing what these SDKs enable: autonomous coding, deployment, and iteration. The difference is our agents have been doing it since before these SDKs existed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Federal court: no attorney-client privilege for AI chats
&lt;/h2&gt;

&lt;p&gt;A federal judge in the Southern District of New York ruled in &lt;em&gt;US v. Heppner&lt;/em&gt; that conversations with AI chatbots are not protected by attorney-client privilege. Your ChatGPT logs can be subpoenaed.&lt;/p&gt;

&lt;p&gt;The same week, Anthropic started requiring government ID verification (via Persona) before allowing subscriptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; The era of "AI as private confidant" just legally ended. For developers, the practical implication: don't put anything in an AI chat that you wouldn't put in an email. If you're using &lt;a href="https://www.aimadetools.com/blog/claude-code-vs-cursor-2026/?utm_source=devto" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; or &lt;a href="https://www.aimadetools.com/blog/claude-code-vs-codex-cli-vs-gemini-cli/?utm_source=devto" rel="noopener noreferrer"&gt;Codex CLI&lt;/a&gt; on proprietary code, make sure your company's legal team knows. And if you're building AI products, your users' chat logs are now discoverable — plan your data retention accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anthropic stops letting developers pin model versions
&lt;/h2&gt;

&lt;p&gt;Anthropic removed the ability to pin specific Claude model versions, forcing users onto the latest &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; even when it breaks downstream client apps. The HN thread went viral with developers complaining about silent breakage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; This is a real problem for production systems. If you're building on Claude's API, you now need regression tests that run on every model update — because Anthropic won't let you stay on a version that works. This is exactly the kind of issue we cover in our &lt;a href="https://www.aimadetools.com/blog/llm-regression-testing/?utm_source=devto" rel="noopener noreferrer"&gt;LLM regression testing guide&lt;/a&gt;. The fix: test against the latest model in CI, but have a fallback to &lt;a href="https://www.aimadetools.com/blog/openrouter-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt; or another provider if quality drops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Allbirds pivots from sneakers to AI compute, stock pops 600%
&lt;/h2&gt;

&lt;p&gt;The struggling shoe retailer announced a $50M convertible financing facility and is pivoting to "AI compute infrastructure" after selling its sneaker brand for $39M. The stock jumped 600% in a single morning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; We've officially entered the "put AI in your company name and watch the stock go up" phase. This is the 2021 crypto pivot playbook all over again. For developers: ignore the noise. The actual compute market is real (&lt;a href="https://www.aimadetools.com/blog/best-cloud-gpu-providers-2026/?utm_source=devto" rel="noopener noreferrer"&gt;cloud GPU providers&lt;/a&gt; are genuinely useful), but a shoe company becoming a GPU-as-a-Service provider is not where you want to deploy your models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apple sends Siri team to coding bootcamp
&lt;/h2&gt;

&lt;p&gt;The Information reported that Apple is sending a chunk of its Siri team — fewer than 200 people — to a multi-week bootcamp to learn how to code using AI, two months before the expected major Siri revamp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My take:&lt;/strong&gt; Even Apple's voice assistant team needs to learn &lt;a href="https://www.aimadetools.com/blog/vibe-coding-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;vibe coding&lt;/a&gt; now. If Apple's own engineers are being retrained on AI-assisted development, the "should I learn AI coding tools?" question is answered. Yes. Yesterday.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick hits
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Shopify open-sourced "autoresearch"&lt;/strong&gt; — an autonomous experiment loop that cut their CI pipeline build time by 65%. Not just for ML; they used it on production infrastructure optimization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vercel CEO signaled IPO readiness&lt;/strong&gt; — 30% of apps on Vercel are now deployed by AI agents. ARR hit $340M (up from $100M in early 2024).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CoreWeave landed $6B from Jane Street&lt;/strong&gt; plus a $1B equity investment. The quant trading firm is now a major shareholder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude had elevated errors&lt;/strong&gt; across Claude.ai, API, and &lt;a href="https://www.aimadetools.com/blog/how-to-use-claude-code/?utm_source=devto" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; on Wednesday. Growing pains from tripling revenue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google launched Gemini 3.1 Flash TTS&lt;/strong&gt; with 70-language support and scene direction for expressive voices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini for Mac&lt;/strong&gt; launched as a native Swift app — share your screen with Gemini in real time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nature published a "subliminal trait transmission" paper&lt;/strong&gt; — language models can transmit behavioral traits through hidden signals in training data. Major implication for &lt;a href="https://www.aimadetools.com/blog/ai-security-checklist-startups/?utm_source=devto" rel="noopener noreferrer"&gt;AI safety&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;N-Day-Bench cyber leaderboard&lt;/strong&gt; — GPT-5.4 leads (83.93), &lt;a href="https://www.aimadetools.com/blog/glm-5-1-complete-guide/?utm_source=devto" rel="noopener noreferrer"&gt;GLM-5.1&lt;/a&gt; at #2 (80.13) above Claude Opus 4.6 (79.95). Open-weight model beating Claude on cybersecurity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA Nemotron 3 Super&lt;/strong&gt; — 120B/12B-active MoE with 1M context, 2.2x throughput vs comparable models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cal.com closed its open-source core&lt;/strong&gt; — citing AI-automated code scanning making open source a security liability. Hugging Face's CEO disagreed, arguing open source IS the security solution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft exec proposed AI agents should pay for software seats&lt;/strong&gt; — 10 employees × 5 agents each = 50 paid licenses. The SaaS pricing model is about to get weird.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I'm watching
&lt;/h2&gt;

&lt;p&gt;The agent infrastructure convergence is the story. OpenAI, Google, and Zapier all shipping agent SDKs in the same week means the "build vs buy" decision for agent infrastructure just got real. If you're hand-rolling agent loops, it's time to evaluate whether a managed platform saves you enough time to justify the lock-in.&lt;/p&gt;

&lt;p&gt;The OpenAI valuation crack is worth watching too. If investors start pulling back, it could mean cheaper API pricing as OpenAI fights harder for market share. That's good for developers.&lt;/p&gt;

&lt;p&gt;And the model version pinning issue from Anthropic is a canary in the coal mine. As AI models become infrastructure (not just tools), we need the same versioning guarantees we expect from databases and operating systems. Right now, we don't have them.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;See you next Thursday. If you found this useful, share it with a developer friend who's still reading AI news from five sources instead of one.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Previous issues: &lt;a href="https://www.aimadetools.com/blog/ai-dev-weekly-005-anthropic-mythos-30b-glm-meta-muse/?utm_source=devto" rel="noopener noreferrer"&gt;#5: Anthropic's Too-Dangerous Model&lt;/a&gt; · &lt;a href="https://www.aimadetools.com/blog/ai-dev-weekly-004-anthropic-leaks-openai-122b-qwen-free/?utm_source=devto" rel="noopener noreferrer"&gt;#4: Anthropic Leaks Everything&lt;/a&gt; · &lt;a href="https://www.aimadetools.com/blog/ai-dev-weekly-003-claude-code-auto-mode-cursor-kimi-github-data/?utm_source=devto" rel="noopener noreferrer"&gt;#3: Claude Code Auto Mode&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;Related: &lt;a href="https://www.aimadetools.com/blog/how-to-choose-ai-coding-agent-2026/?utm_source=devto" rel="noopener noreferrer"&gt;How to Choose an AI Coding Agent&lt;/a&gt; · &lt;a href="https://www.aimadetools.com/blog/ai-coding-tools-pricing-2026/?utm_source=devto" rel="noopener noreferrer"&gt;AI Coding Tools Pricing&lt;/a&gt; · &lt;a href="https://dev.to/race/"&gt;The $100 AI Startup Race&lt;/a&gt; · &lt;a href="https://www.aimadetools.com/blog/llm-regression-testing/?utm_source=devto" rel="noopener noreferrer"&gt;LLM Regression Testing&lt;/a&gt; · &lt;a href="https://www.aimadetools.com/blog/how-to-build-ai-agent-2026/?utm_source=devto" rel="noopener noreferrer"&gt;How to Build an AI Agent&lt;/a&gt;&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.aimadetools.com/blog/ai-dev-weekly-006-openai-852b-gpt-erdos-agent-infrastructure/?utm_source=devto" rel="noopener noreferrer"&gt;https://www.aimadetools.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aidevweekly</category>
      <category>openai</category>
      <category>anthropic</category>
      <category>agents</category>
    </item>
    <item>
      <title>I'm Giving 7 AI Coding Agents $100 Each to Build a Startup — Here's What Happens</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Mon, 13 Apr 2026 10:01:49 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/im-giving-7-ai-coding-agents-100-each-to-build-a-startup-heres-what-happens-62k</link>
      <guid>https://dev.to/ai_made_tools/im-giving-7-ai-coding-agents-100-each-to-build-a-startup-heres-what-happens-62k</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; 7 AI coding agents (Claude, GPT, Gemini, DeepSeek, Kimi, Xiaomi, GLM) each get $100 and 12 weeks to autonomously build a real, revenue-generating startup. Public repos, live sites, zero human code. Starts April 20.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment
&lt;/h2&gt;

&lt;p&gt;I wanted to answer a simple question: &lt;strong&gt;can AI actually build a business, not just write code?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not a demo. Not a toy project. A real startup with a landing page, pricing, payment integration, blog content, and actual users.&lt;/p&gt;

&lt;p&gt;So I set up 7 AI coding agents on a VPS, gave each one $100 and a 30-minute session timer, and let them run. They choose their own ideas, write their own code, deploy their own sites, and request help (domains, Stripe) via GitHub Issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  The agents
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Origin&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🟣 Claude&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Sonnet / Haiku&lt;/td&gt;
&lt;td&gt;🇺🇸 Anthropic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟢 GPT&lt;/td&gt;
&lt;td&gt;Codex CLI&lt;/td&gt;
&lt;td&gt;GPT-5.4 / Mini&lt;/td&gt;
&lt;td&gt;🇺🇸 OpenAI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔵 Gemini&lt;/td&gt;
&lt;td&gt;Gemini CLI&lt;/td&gt;
&lt;td&gt;Pro / Flash&lt;/td&gt;
&lt;td&gt;🇺🇸 Google&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔴 DeepSeek&lt;/td&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;td&gt;Reasoner / Chat&lt;/td&gt;
&lt;td&gt;🇨🇳 DeepSeek&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟠 Kimi&lt;/td&gt;
&lt;td&gt;Kimi CLI&lt;/td&gt;
&lt;td&gt;K2.5&lt;/td&gt;
&lt;td&gt;🇨🇳 Moonshot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟡 Xiaomi&lt;/td&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;td&gt;MiMo V2 Pro&lt;/td&gt;
&lt;td&gt;🇨🇳 Xiaomi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟤 GLM&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;GLM-5.1 / 4.7&lt;/td&gt;
&lt;td&gt;🇨🇳 Z.ai&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;3 US models vs 4 Chinese models. 5 different coding tools. Subscriptions vs API pricing. The playing field is deliberately uneven — just like real life.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rules
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;$100 budget&lt;/strong&gt; per agent for the startup (domains, services, tools). AI model costs are separate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fully autonomous&lt;/strong&gt; — no human writes code or makes product decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 hour of human help per agent per week&lt;/strong&gt; — only for things AI physically can't do (buy domains, set up Stripe)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public repos&lt;/strong&gt; — watch them build in real-time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surprise events&lt;/strong&gt; throughout the 12 weeks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What we learned from the test run
&lt;/h2&gt;

&lt;p&gt;We ran 3 test rounds before launch. Key findings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kimi was the best performer&lt;/strong&gt; — it didn't just code, it planned a full Product Hunt launch strategy with social media templates and screenshots&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek was the most prolific&lt;/strong&gt; — 302 commits in 5 days, but chose a saturated market (name generators)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini over-engineered&lt;/strong&gt; — chose Next.js, spent 5 days fighting deploy errors, never shipped&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Xiaomi was the most efficient per commit&lt;/strong&gt; — built a complete product in just 31 commits before running out of API budget&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Qwen was removed&lt;/strong&gt; — filed duplicate help requests, created files with social media posts as filenames, stalled for 25 hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GLM-5.1 (the #1 model on SWE-Bench Pro) replaces Qwen for the real race.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scoring
&lt;/h2&gt;

&lt;p&gt;At the end of 12 weeks, agents are scored on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Revenue earned (25 pts)&lt;/li&gt;
&lt;li&gt;Users / traffic (20 pts)&lt;/li&gt;
&lt;li&gt;Community vote (20 pts)&lt;/li&gt;
&lt;li&gt;Code quality (15 pts)&lt;/li&gt;
&lt;li&gt;Cost efficiency (10 pts)&lt;/li&gt;
&lt;li&gt;AI peer review (10 pts)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Follow along
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard:&lt;/strong&gt; &lt;a href="https://www.aimadetools.com/race?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=race-announcement" rel="noopener noreferrer"&gt;aimadetools.com/race&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Daily digest:&lt;/strong&gt; Updated daily with standings and highlights&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekly recaps:&lt;/strong&gt; In-depth analysis every week&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All repos are public&lt;/strong&gt; on GitHub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The race starts &lt;strong&gt;April 20, 2026&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What startup idea would YOU give an AI agent? Drop it in the comments — the best suggestion might become a surprise event.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about AI coding tools, model comparisons, and developer productivity at &lt;a href="https://www.aimadetools.com?utm_source=devto&amp;amp;utm_medium=post&amp;amp;utm_campaign=race-announcement" rel="noopener noreferrer"&gt;aimadetools.com&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>startup</category>
      <category>coding</category>
    </item>
    <item>
      <title>I Used ChatGPT Plus for a Week — The Swiss Army Knife That's Not a Scalpel</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Sun, 12 Apr 2026 09:51:53 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/i-used-chatgpt-plus-for-a-week-the-swiss-army-knife-thats-not-a-scalpel-2jii</link>
      <guid>https://dev.to/ai_made_tools/i-used-chatgpt-plus-for-a-week-the-swiss-army-knife-thats-not-a-scalpel-2jii</guid>
      <description>&lt;p&gt;&lt;em&gt;This is week 5 of my "I Used It for a Week" series. So far I've reviewed &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt; (speed), &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; (specs), &lt;a href="https://www.aimadetools.com/blog/github-copilot-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;GitHub Copilot&lt;/a&gt; (ecosystem), and &lt;a href="https://www.aimadetools.com/blog/windsurf-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Windsurf&lt;/a&gt; (budget pick). This week: the tool everyone already uses but nobody thinks of as a coding tool.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let me be upfront: ChatGPT is not a code editor. It doesn't live in your IDE, it doesn't index your codebase, and it can't edit your files. Comparing it directly to &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt; or &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; isn't fair.&lt;/p&gt;

&lt;p&gt;But here's the thing — I used it more than any of them this week. Just not for the same things.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I subscribed to ChatGPT Plus at $20/month. That gets you GPT-5.2, DALL-E 3, and priority access. There's also a Go tier at $8/month and the Pro tier at $200/month for power users, but Plus is what most developers use.&lt;/p&gt;

&lt;p&gt;OpenAI's pricing tiers in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free&lt;/strong&gt;: GPT-5 with strict limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go&lt;/strong&gt;: $8/month — extended limits, custom GPTs, voice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plus&lt;/strong&gt;: $20/month — GPT-5.2, higher limits, DALL-E 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pro&lt;/strong&gt;: $200/month — GPT-5.4 Thinking, highest limits, Sora&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I stuck with Plus because $200/month for Pro is hard to justify when &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor costs $20&lt;/a&gt; and does the actual coding part better.&lt;/p&gt;

&lt;h2&gt;
  
  
  What ChatGPT Is Actually Great At
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Thinking partner, not typing partner
&lt;/h3&gt;

&lt;p&gt;The biggest shift in my week was realizing ChatGPT's value isn't in writing code — it's in thinking about code. I used it to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Debate architecture decisions before opening my editor&lt;/li&gt;
&lt;li&gt;Explain unfamiliar codebases ("here's a 200-line file, explain what it does")&lt;/li&gt;
&lt;li&gt;Rubber-duck debug problems I was stuck on&lt;/li&gt;
&lt;li&gt;Generate &lt;a href="https://www.aimadetools.com/blog/regex-tester/?utm_source=devto" rel="noopener noreferrer"&gt;regex&lt;/a&gt; patterns and SQL queries I'd otherwise spend 20 minutes on&lt;/li&gt;
&lt;li&gt;Draft API contracts before implementing them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of the IDE tools do this well. &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's chat&lt;/a&gt; is focused on your current codebase. &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro's spec mode&lt;/a&gt; is structured and formal. ChatGPT is just... a conversation. Sometimes that's exactly what you need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Learning accelerator
&lt;/h3&gt;

&lt;p&gt;I was picking up a new library this week, and ChatGPT was invaluable. "Explain how React Server Components work with concrete examples." "What's the difference between these two approaches?" "Show me the tradeoffs."&lt;/p&gt;

&lt;p&gt;It's like having a patient senior developer who never gets annoyed by basic questions. The IDE tools assume you already know what you're building. ChatGPT helps you figure out &lt;em&gt;what&lt;/em&gt; to build.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing everything that isn't code
&lt;/h3&gt;

&lt;p&gt;Documentation, commit messages, PR descriptions, technical specs, email drafts, blog outlines — ChatGPT handles all of this faster than I can type. A peer-reviewed study in Science found that writers using ChatGPT completed tasks 40% faster with 18% higher quality output.&lt;/p&gt;

&lt;p&gt;This is where the $20/month pays for itself even if you never write a line of code with it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canvas mode for iteration
&lt;/h3&gt;

&lt;p&gt;The Canvas feature lets you collaborate on a document or code snippet side by side. It's not as powerful as &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's multi-file editing&lt;/a&gt;, but for iterating on a single file or algorithm, it's surprisingly good. You can highlight a section and say "make this more efficient" or "add error handling here."&lt;/p&gt;

&lt;h2&gt;
  
  
  What Frustrated Me
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The coding quality rollercoaster
&lt;/h3&gt;

&lt;p&gt;Multiple OpenAI forum threads tell the same story: GPT-5's coding ability feels inconsistent. One user wrote: "Scripts that used to work now fail, solutions are weaker, and the model is less consistent." Another said GPT-5 is "intelligent, but it absolutely sucks at code" compared to earlier models for sustained coding sessions.&lt;/p&gt;

&lt;p&gt;My experience matched this. For isolated coding questions — "write a function that does X" — it's great. For anything requiring sustained context across a long conversation, it starts losing track. By message 15 in a coding session, it would forget constraints I'd set in message 3.&lt;/p&gt;

&lt;h3&gt;
  
  
  No codebase awareness
&lt;/h3&gt;

&lt;p&gt;This is the fundamental limitation. ChatGPT doesn't know your project. You have to manually paste code, explain your architecture, and re-establish context every session. After using &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's deep indexing&lt;/a&gt; and &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro's spec-driven context&lt;/a&gt;, going back to copy-pasting code snippets into a chat window feels primitive.&lt;/p&gt;

&lt;p&gt;Yes, you can upload files. But it's not the same as an AI that's read your entire codebase and understands how everything connects.&lt;/p&gt;

&lt;h3&gt;
  
  
  The limits are real
&lt;/h3&gt;

&lt;p&gt;Even on Plus, you hit usage caps on GPT-5.2. During heavy use days, I got throttled to slower models. The dynamic caps mean you never quite know when you'll hit the wall. One reviewer noted: "While the $20 plan unlocks GPT-5.2 and DALL-E 3, it still has a trap: limits."&lt;/p&gt;

&lt;p&gt;Pro at $200/month removes most limits, but that's 10x the price of &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt; or &lt;a href="https://www.aimadetools.com/blog/github-copilot-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Copilot&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  It doesn't execute
&lt;/h3&gt;

&lt;p&gt;ChatGPT generates code. You copy it. You paste it. You run it. It fails. You copy the error. You paste it back. It fixes it. You copy again.&lt;/p&gt;

&lt;p&gt;This loop is &lt;em&gt;exhausting&lt;/em&gt; after using tools that edit your files directly. &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's agent&lt;/a&gt; runs the code, sees the error, and fixes it — all without you touching the clipboard. &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro's hooks&lt;/a&gt; run tests automatically. ChatGPT just... talks about code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where ChatGPT Fits in My Stack
&lt;/h2&gt;

&lt;p&gt;After four weeks of testing, here's how I actually use each tool:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Best Tool&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Writing code in my editor&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Tab completion, multi-file agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Planning new features&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Spec workflow, structured design&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning new tech&lt;/td&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;Conversational, patient, broad knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging logic&lt;/td&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;Good at reasoning about problems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture decisions&lt;/td&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;Thinks through tradeoffs well&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Writing docs/emails&lt;/td&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;Fast, good quality prose&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quick code generation&lt;/td&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;Isolated snippets, regex, SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large refactoring&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Subagents, codebase awareness&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ChatGPT is the tool I use &lt;em&gt;around&lt;/em&gt; coding, not &lt;em&gt;for&lt;/em&gt; coding. And that's fine — it's genuinely the best at that role.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Verdict After 7 Days
&lt;/h2&gt;

&lt;p&gt;ChatGPT Plus is worth $20/month for any developer, but not as a coding tool. It's a thinking tool, a learning tool, and a writing tool that happens to understand code.&lt;/p&gt;

&lt;p&gt;If you're choosing between ChatGPT Plus and &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor Pro&lt;/a&gt; and can only afford one, get Cursor. It'll save you more time on actual coding. But if you can afford both, they complement each other perfectly — Cursor for the doing, ChatGPT for the thinking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Would I keep paying?&lt;/strong&gt; Yes, without hesitation. But I'd never use it as my primary coding tool when &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;, &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt;, and &lt;a href="https://www.aimadetools.com/blog/github-copilot-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Copilot&lt;/a&gt; exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who should subscribe:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every developer (the thinking/learning value alone is worth it)&lt;/li&gt;
&lt;li&gt;Non-technical founders who need to understand code&lt;/li&gt;
&lt;li&gt;Anyone who writes documentation, emails, or specs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Who doesn't need it for coding:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anyone already using Cursor or Kiro (they're better at the actual coding)&lt;/li&gt;
&lt;li&gt;Developers who only need inline completions (Copilot is cheaper)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;Next week: &lt;a href="https://www.aimadetools.com/blog/devin-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;I Used Devin for a Week&lt;/a&gt; — the most hyped AI tool in recent memory. Is the "first AI software engineer" real, or just a great demo?&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.aimadetools.com/blog/chatgpt-plus-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;https://www.aimadetools.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>chatgpt</category>
      <category>openai</category>
      <category>aitools</category>
      <category>review</category>
    </item>
    <item>
      <title>I Used GitHub Copilot for a Week — The Safe Choice That's Falling Behind</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Sat, 11 Apr 2026 09:49:10 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/i-used-github-copilot-for-a-week-the-safe-choice-thats-falling-behind-5c9m</link>
      <guid>https://dev.to/ai_made_tools/i-used-github-copilot-for-a-week-the-safe-choice-thats-falling-behind-5c9m</guid>
      <description>&lt;p&gt;&lt;em&gt;This is week 3 of my "I Used It for a Week" series. I reviewed &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt; (the speed demon) and &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; (the spec planner). Now it's time for the one most developers actually use: GitHub Copilot.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's the thing about Copilot — I used it for over a year before trying Cursor and Kiro. It was my baseline. The tool I compared everything else to. Going back to it after two weeks with the competition was... revealing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;Unlike Cursor and Kiro, Copilot isn't a standalone editor. It's an extension that lives inside your existing IDE — VS Code, JetBrains, Neovim, Xcode, even Eclipse. That's its biggest strength and its biggest limitation.&lt;/p&gt;

&lt;p&gt;I installed it in VS Code (my default before the Cursor experiment) and picked up right where I left off. All my extensions, all my settings, zero switching cost. If you've never used an AI coding tool before, this is the easiest possible starting point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Still Works Well
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Inline completions are solid
&lt;/h3&gt;

&lt;p&gt;Copilot's bread and butter — the ghost text that appears as you type — is still good. It predicts the next few lines based on your current file and open tabs. For writing boilerplate, implementing interfaces, and filling in repetitive patterns, it saves real time.&lt;/p&gt;

&lt;p&gt;A ProductHunt reviewer summed it up: "It saves time by suggesting accurate code snippets and helps me stay in flow while coding." That matches my experience. For straightforward coding, Copilot just works.&lt;/p&gt;

&lt;h3&gt;
  
  
  IDE flexibility is unmatched
&lt;/h3&gt;

&lt;p&gt;This is Copilot's trump card. &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor locks you into their VS Code fork&lt;/a&gt;. &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro is also VS Code-based&lt;/a&gt;. Copilot works in everything. If you're a JetBrains user (IntelliJ, PyCharm, WebStorm), Copilot is basically your only option among the big three.&lt;/p&gt;

&lt;p&gt;For teams with mixed IDE preferences, this matters a lot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent mode has caught up (mostly)
&lt;/h3&gt;

&lt;p&gt;Copilot launched agent mode in February 2025, and by 2026 it's genuinely useful. You can ask it to plan changes, edit multiple files, run terminal commands, and iterate until the task is done. The coding agent can even turn GitHub Issues into pull requests autonomously.&lt;/p&gt;

&lt;p&gt;With the March 2026 update, you can now select GPT-5.4 for agent mode across all supported IDEs. The quality jump from the older models is noticeable.&lt;/p&gt;

&lt;h3&gt;
  
  
  The GitHub ecosystem
&lt;/h3&gt;

&lt;p&gt;Copilot's integration with GitHub is seamless in ways the competition can't match. Code review suggestions on pull requests, automated security scanning, Copilot Workspace for planning features directly from issues — if your team lives on GitHub, this ecosystem is valuable.&lt;/p&gt;

&lt;p&gt;The Copilot SDK (production-ready since January 2026) lets enterprises build custom agents trained on their own architectural patterns. With 4.7 million paid users, the ecosystem is massive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Price
&lt;/h3&gt;

&lt;p&gt;The free tier gives you 2,000 completions and 50 agent/chat requests per month. That's enough to evaluate it properly. Pro at $10/month is the cheapest paid option among the big three — half the price of &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's $20/month&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Frustrated Me (Coming Back From Cursor and Kiro)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Context awareness is shallow
&lt;/h3&gt;

&lt;p&gt;This is where Copilot falls hardest behind. After using &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's deep codebase indexing&lt;/a&gt; and &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro's spec-driven context&lt;/a&gt;, Copilot's understanding of my project felt surface-level.&lt;/p&gt;

&lt;p&gt;Copilot primarily works from the current file and open tabs. It doesn't index your entire repository the way Cursor does. In testing across projects exceeding 10,000 lines, suggestions were accurate only about 50% of the time. It frequently suggested APIs and methods that didn't exist in my codebase.&lt;/p&gt;

&lt;p&gt;One TrustRadius reviewer nailed it: "Copilot is not the best at analyzing large monolithic codebases and placing them in their context."&lt;/p&gt;

&lt;h3&gt;
  
  
  No next-edit prediction
&lt;/h3&gt;

&lt;p&gt;After two weeks of &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's Tab-Tab-Tab workflow&lt;/a&gt; — where it predicts not just the current line but your &lt;em&gt;next edit location&lt;/em&gt; — going back to Copilot's basic inline suggestions felt like downgrading. Copilot completes the line you're on. Cursor anticipates where you're going next. That difference compounds over a full day of coding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-file editing is weaker
&lt;/h3&gt;

&lt;p&gt;Copilot's agent mode can edit multiple files, but it doesn't match Cursor's subagent system or Kiro's spec-guided implementation. The trade-off is architectural: Copilot works through extension APIs rather than controlling the whole editor environment. It can't understand your codebase as deeply because it's a guest in someone else's house.&lt;/p&gt;

&lt;p&gt;For quick single-file edits, this doesn't matter. For large refactoring across 10+ files, the difference is stark.&lt;/p&gt;

&lt;h3&gt;
  
  
  No spec workflow, no hooks
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro's spec-driven approach&lt;/a&gt; and Agent Hooks have no equivalent in Copilot. There's no way to define requirements before coding, no automated triggers on file changes, and no structured planning workflow. Copilot is reactive — it responds to what you're doing. It doesn't help you figure out what you &lt;em&gt;should&lt;/em&gt; be doing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security concerns are real
&lt;/h3&gt;

&lt;p&gt;Multiple reviews and studies flag that Copilot can suggest insecure code patterns. Since it learns from public repositories, it sometimes pulls in outdated or vulnerable patterns. This isn't unique to Copilot — all AI coding tools have this risk — but Copilot's shallower context awareness means it's less likely to understand your project's specific security requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pricing Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;2,000 completions, 50 chat/agent requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$10/month&lt;/td&gt;
&lt;td&gt;Unlimited completions, premium model access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pro+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$39/month&lt;/td&gt;
&lt;td&gt;More premium requests, coding agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Business&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$19/user/month&lt;/td&gt;
&lt;td&gt;Organization management, policy controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$39/user/month&lt;/td&gt;
&lt;td&gt;SSO, SCIM, audit logs, IP indemnity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The free tier is genuinely useful for evaluation. Pro at $10/month is the sweet spot for individuals. But note: heavy agent usage on Pro can hit limits, pushing you toward Pro+ at $39/month — which is nearly double &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's flat $20&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Tool Comparison
&lt;/h2&gt;

&lt;p&gt;After using all three for a week each, here's my honest ranking by category:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;th&gt;Runner-up&lt;/th&gt;
&lt;th&gt;Third&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inline completions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cursor (next-edit)&lt;/td&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;td&gt;Kiro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-file refactoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cursor (subagents)&lt;/td&gt;
&lt;td&gt;Kiro (spec-guided)&lt;/td&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Planning &amp;amp; architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kiro (specs)&lt;/td&gt;
&lt;td&gt;Copilot (Workspace)&lt;/td&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IDE flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Copilot (all IDEs)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Cursor/Kiro (VS Code only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Codebase understanding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cursor (deep index)&lt;/td&gt;
&lt;td&gt;Kiro (spec context)&lt;/td&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Price (value)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Copilot ($10/mo)&lt;/td&gt;
&lt;td&gt;Cursor ($20/mo)&lt;/td&gt;
&lt;td&gt;Kiro (variable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ecosystem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Copilot (GitHub)&lt;/td&gt;
&lt;td&gt;Kiro (AWS)&lt;/td&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed of small edits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;td&gt;Kiro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Kiro (spec-driven)&lt;/td&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  My Verdict After 7 Days
&lt;/h2&gt;

&lt;p&gt;Copilot is the Toyota Corolla of AI coding tools. It's reliable, affordable, works everywhere, and gets the job done. There's a reason 4.7 million developers pay for it.&lt;/p&gt;

&lt;p&gt;But after experiencing &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor's speed&lt;/a&gt; and &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro's discipline&lt;/a&gt;, Copilot feels like it's coasting on distribution rather than innovation. The GitHub integration and IDE flexibility keep it relevant, but the core AI experience — context awareness, multi-file editing, intelligent suggestions — is falling behind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Would I keep paying?&lt;/strong&gt; Only if I needed JetBrains support or was on a team standardized on GitHub's ecosystem. For VS Code users, Cursor is a better tool at twice the price — and the productivity gains more than cover the difference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who should use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JetBrains users (no real alternative)&lt;/li&gt;
&lt;li&gt;Teams already deep in the GitHub ecosystem&lt;/li&gt;
&lt;li&gt;Developers who want the cheapest entry point&lt;/li&gt;
&lt;li&gt;Anyone who doesn't want to switch editors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Who should look elsewhere:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VS Code users who want the best AI experience (→ &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Solo developers building features from scratch (→ &lt;a href="https://www.aimadetools.com/blog/kiro-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Anyone doing heavy multi-file refactoring&lt;/li&gt;
&lt;li&gt;Developers who want deep codebase understanding&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tips If You're Starting
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use agent mode, not just inline suggestions&lt;/strong&gt; — the inline completions are table stakes now, the agent is where the value is&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try GPT-5.4 as your model&lt;/strong&gt; — it's a significant upgrade over the default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open relevant files in tabs&lt;/strong&gt; — Copilot uses open tabs for context, so more tabs = better suggestions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't trust security-sensitive suggestions blindly&lt;/strong&gt; — review anything touching auth, encryption, or user data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider the free tier first&lt;/strong&gt; — 2,000 completions/month is enough to decide if it's for you&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;That's three weeks, three tools. My current setup: Cursor for daily coding, Kiro for new features, Copilot retired. Your mileage may vary — the best tool is the one that matches how you think, not how I think.&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.aimadetools.com/blog/github-copilot-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;https://www.aimadetools.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>githubcopilot</category>
      <category>aitools</category>
      <category>review</category>
      <category>coding</category>
    </item>
    <item>
      <title>Claude Code vs Cursor — Terminal Agent vs AI IDE (2026)</title>
      <dc:creator>Joske Vermeulen</dc:creator>
      <pubDate>Fri, 10 Apr 2026 10:11:37 +0000</pubDate>
      <link>https://dev.to/ai_made_tools/claude-code-vs-cursor-terminal-agent-vs-ai-ide-2026-1117</link>
      <guid>https://dev.to/ai_made_tools/claude-code-vs-cursor-terminal-agent-vs-ai-ide-2026-1117</guid>
      <description>&lt;p&gt;Claude Code and Cursor are the two AI coding tools developers argue about most in 2026. They represent fundamentally different philosophies: Claude Code is a terminal agent that reads your codebase and executes autonomously. Cursor is a VS Code fork with AI deeply integrated into the editing experience.&lt;/p&gt;

&lt;p&gt;The Pragmatic Engineer's 2026 survey of nearly 1,000 developers found Claude Code is now the #1 most-used AI coding tool, overtaking both Copilot and Cursor in just eight months. But Cursor grew 35% in the same period. Both are winning — just for different developers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Difference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claude Code&lt;/strong&gt; = you describe what you want, the AI does it. You review the result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor&lt;/strong&gt; = you write code with AI assistance. The AI suggests, you decide in real-time.&lt;/p&gt;

&lt;p&gt;That's the fundamental split. Claude Code is an autonomous agent. Cursor is an augmented editor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Cursor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Terminal&lt;/td&gt;
&lt;td&gt;VS Code fork&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Approach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Autonomous agent&lt;/td&gt;
&lt;td&gt;Augmented editor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Usage-based (~$5-20/session)&lt;/td&gt;
&lt;td&gt;$20/mo flat (Pro)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context window&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K (1M in beta)&lt;/td&gt;
&lt;td&gt;Varies by model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Codebase awareness&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reads entire repo&lt;/td&gt;
&lt;td&gt;Indexes entire project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-file editing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native (agent does it)&lt;/td&gt;
&lt;td&gt;Composer mode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tab completion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (multi-line + next-edit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Opus 4.6 (default)&lt;/td&gt;
&lt;td&gt;Claude, GPT, Gemini — your pick&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IDE integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works with any editor&lt;/td&gt;
&lt;td&gt;Cursor only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Git integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Can commit, push, branch&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Runs commands&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (shell access)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Where Claude Code Wins
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Autonomy
&lt;/h3&gt;

&lt;p&gt;You can tell Claude Code "refactor the auth system to use &lt;a href="https://www.aimadetools.com/blog/jwt-decoder/?utm_source=devto" rel="noopener noreferrer"&gt;JWT&lt;/a&gt; tokens" and walk away. It'll read the codebase, plan the changes, modify files, run tests, fix errors, and commit. Cursor's Composer is powerful, but it still expects you to be in the loop reviewing each step.&lt;/p&gt;

&lt;p&gt;For large, well-defined tasks, Claude Code's autonomy is a massive time saver.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context window
&lt;/h3&gt;

&lt;p&gt;Claude Code runs on Opus 4.6 with a 200K context window (1M in beta). It can hold your entire codebase in context for medium-sized projects. Cursor's context is limited by whichever model you're using and how much of your project it indexes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Works with any editor
&lt;/h3&gt;

&lt;p&gt;Claude Code runs in your terminal. You can use it alongside VS Code, JetBrains, Neovim, Vim — whatever. It doesn't care about your editor. Cursor forces you into their VS Code fork.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shell access
&lt;/h3&gt;

&lt;p&gt;Claude Code can run your tests, start your dev server, check build errors, and fix them — all in the same session. It has full shell access. Cursor's terminal integration exists but the AI doesn't interact with it as naturally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Developer love
&lt;/h3&gt;

&lt;p&gt;46% of developers in the Pragmatic Engineer survey named Claude Code as the tool they love most. Cursor was at 19%. That's a significant gap in satisfaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Cursor Wins
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Real-time coding flow
&lt;/h3&gt;

&lt;p&gt;Cursor's Tab predictions and inline suggestions keep you in a flow state. You're writing code, and the AI is right there suggesting the next line, the next edit, the next file to change. Claude Code has no inline editing — you describe, it executes, you review. Different rhythm entirely.&lt;/p&gt;

&lt;p&gt;If you enjoy the act of writing code (not just describing it), Cursor feels better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visual feedback
&lt;/h3&gt;

&lt;p&gt;You see changes happening in real-time in your editor. Diffs are highlighted, you can accept or reject individual changes. With Claude Code, you see terminal output and then check the files afterward. For developers who think visually, Cursor's approach is more intuitive.&lt;/p&gt;

&lt;h3&gt;
  
  
  Predictable pricing
&lt;/h3&gt;

&lt;p&gt;Cursor Pro is $20/month, period. Claude Code is usage-based — a heavy session can cost $5-20 depending on the model and how much context you're feeding it. If you code 8 hours a day, Claude Code can get expensive fast. Cursor's flat rate is simpler to budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model flexibility
&lt;/h3&gt;

&lt;p&gt;Cursor lets you switch between Claude, GPT, and Gemini models per task. Claude Code only runs Claude models. If you want GPT-5.4 for a specific task, you can't do that in Claude Code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing Reality
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Claude Code
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Runs on your Anthropic API key or Claude Max subscription&lt;/li&gt;
&lt;li&gt;Claude Max: $100/mo (5x usage), $200/mo (20x usage)&lt;/li&gt;
&lt;li&gt;API: ~$5-15 per heavy coding session (varies wildly)&lt;/li&gt;
&lt;li&gt;No free tier for coding use&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cursor
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free:&lt;/strong&gt; 2,000 completions, 50 premium requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pro ($20/mo):&lt;/strong&gt; Unlimited completions, 500 premium requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business ($40/mo):&lt;/strong&gt; Team features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For light-to-moderate use, Cursor is cheaper. For heavy autonomous work, Claude Code can cost more but potentially saves more time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Use What
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Choose Claude Code if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're comfortable in the terminal&lt;/li&gt;
&lt;li&gt;You want maximum autonomy (describe → AI builds)&lt;/li&gt;
&lt;li&gt;You work on large refactoring tasks&lt;/li&gt;
&lt;li&gt;You already pay for Claude Max&lt;/li&gt;
&lt;li&gt;You use a non-VS Code editor&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Cursor if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You love the VS Code editing experience&lt;/li&gt;
&lt;li&gt;You want real-time AI suggestions while you type&lt;/li&gt;
&lt;li&gt;You prefer predictable monthly pricing&lt;/li&gt;
&lt;li&gt;You want to choose between multiple AI models&lt;/li&gt;
&lt;li&gt;You enjoy hands-on coding with AI assistance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The power move:&lt;/strong&gt; Use both. Claude Code for big autonomous tasks ("refactor this entire module"), Cursor for daily editing with inline suggestions. Many developers in the Pragmatic Engineer survey reported using 2-4 AI tools simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Claude Code is next on my &lt;a href="https://www.aimadetools.com/blog/cursor-ai-one-week-review/?utm_source=devto" rel="noopener noreferrer"&gt;I Used It for a Week&lt;/a&gt; review list. Stay tuned.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Related:&lt;/strong&gt; &lt;a href="https://www.aimadetools.com/blog/best-ai-coding-tools-2026/?utm_source=devto" rel="noopener noreferrer"&gt;Best AI Coding Tools in 2026: The Definitive Ranking&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.aimadetools.com/blog/claude-code-vs-cursor-2026/?utm_source=devto" rel="noopener noreferrer"&gt;https://www.aimadetools.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>cursor</category>
      <category>aitools</category>
      <category>comparison</category>
    </item>
  </channel>
</rss>
