<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matthias | StudioMeyer</title>
    <description>The latest articles on DEV Community by Matthias | StudioMeyer (@studiomeyer_io).</description>
    <link>https://dev.to/studiomeyer_io</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3866458%2F170ce662-470b-4f78-ac37-58a9a2a00220.PNG</url>
      <title>DEV Community: Matthias | StudioMeyer</title>
      <link>https://dev.to/studiomeyer_io</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/studiomeyer_io"/>
    <language>en</language>
    <item>
      <title>I let my AI agents rewrite their own prompts. The hard part was stopping them from getting worse.</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Thu, 02 Jul 2026 08:44:40 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/i-let-my-ai-agents-rewrite-their-own-prompts-the-hard-part-was-stopping-them-from-getting-worse-46ed</link>
      <guid>https://dev.to/studiomeyer_io/i-let-my-ai-agents-rewrite-their-own-prompts-the-hard-part-was-stopping-them-from-getting-worse-46ed</guid>
      <description>&lt;p&gt;I let my AI agents rewrite their own prompts. The hard part was stopping them from getting worse.&lt;/p&gt;

&lt;p&gt;Most "self-evolving agent" demos die the moment you think about shipping them. Not because the idea is bad, but because an agent that can rewrite its own prompt can also quietly rewrite itself into something worse. It drifts. A critic starts rewarding the wrong thing. A regression slips in and nobody notices for a week because the output still looks fine at a glance.&lt;/p&gt;

&lt;p&gt;I spent the better part of three months building a TypeScript framework around that exact failure mode, and I want to walk through the part nobody demos: not the clever loop, but the gate that keeps the loop honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "self-evolving" actually means here
&lt;/h2&gt;

&lt;p&gt;The idea is simple. A normal agent has a static prompt. You write it once, and it never gets better on its own. You are the optimizer, forever, by hand.&lt;/p&gt;

&lt;p&gt;Darwin flips that. The agent runs, something measures how good the run was, and over time the system learns where the prompt is weak and proposes a better version. Then, and this is the important bit, it does not just trust the new version. It earns its place.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You run an agent
       |
Darwin measures quality (a critic scores the output)
       |
Patterns emerge over time ("weak on technical topics")
       |
A new prompt variant is generated
       |
A/B tested against the current default
       |
The winner becomes the default
       |
Your agent got better. You did nothing.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last line is the marketing version. The honest version has a lot more machinery under it, because "the winner becomes the default" is where everything can go wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why most of these stay demos
&lt;/h2&gt;

&lt;p&gt;Here is the failure you do not see in a five-minute video. You wire up a loop, an LLM critiques its own output, it rewrites its own prompt, and for the first ten runs it genuinely looks like it is improving. Then one of these happens:&lt;/p&gt;

&lt;p&gt;The critic optimizes for the wrong signal. It starts rewarding longer answers, or more confident ones, and quality quietly drops while the score goes up.&lt;/p&gt;

&lt;p&gt;A tool the agent depends on has a bad hour. The outputs get worse for reasons that have nothing to do with the prompt, the system reads that as "the current prompt is bad," and it evolves away from a prompt that was actually fine.&lt;/p&gt;

&lt;p&gt;A rewrite erodes a constraint. The old prompt said "never invent a source." The new, higher-scoring variant is more fluent and slightly more willing to make things up. Your score went up. Your safety went down.&lt;/p&gt;

&lt;p&gt;Checking after every single run inflates false positives, so the system declares winners that are just noise.&lt;/p&gt;

&lt;p&gt;None of these are exotic. They are the default outcome if you build the loop and not the gate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gate is the actual product
&lt;/h2&gt;

&lt;p&gt;So the loop is maybe a third of the work. The rest is the set of guards that decide whether a mutation is allowed to survive. Four of them matter most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regression rollback to last-known-good.&lt;/strong&gt; Every promoted prompt has a recorded baseline. If a newly promoted variant underperforms its predecessor past a threshold, it rolls back automatically. Evolution is allowed to try things. It is not allowed to keep things that made the agent worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data-quality guards that pause evolution.&lt;/strong&gt; If the signal feeding the critic looks broken, a tool timing out, empty responses, a spike of errors, evolution pauses instead of learning from garbage. You do not want your agent drawing conclusions during an outage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An alignment check on every mutation.&lt;/strong&gt; Before any rewrite is even eligible, it is checked against the constraints the agent is supposed to hold. A more fluent prompt that quietly drops a safety rule does not get to compete on score, because it never enters the ring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Statistically honest A/B.&lt;/strong&gt; Because the tempting thing is to peek after every run, the gate uses always-valid sequential tests (mSPRT and Hoeffding-style bounds) so that continuous checking does not manufacture significance. A variant wins when it actually won, not when you looked at the right moment.&lt;/p&gt;

&lt;p&gt;If you have ever watched an agent get subtly worse after a prompt change and had no principled way to catch it, that stack is the whole reason this exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  Show me the code
&lt;/h2&gt;

&lt;p&gt;Running an agent is one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;darwin-agents better-sqlite3
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...   &lt;span class="c"&gt;# or OPENAI_API_KEY, or use the Claude CLI&lt;/span&gt;

npx darwin run writer &lt;span class="s2"&gt;"Explain the CAP theorem in simple terms"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Turning on evolution is opt-in, per agent. Nothing rewrites itself unless you say so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx darwin evolve writer &lt;span class="nt"&gt;--enable&lt;/span&gt;
npx darwin status writer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Defining your own agent is about a dozen lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;defineAgent&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;darwin-agents&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defineAgent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;writer&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;You explain technical topics clearly and simply.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// evolution is off until you enable it; the safety gate is always on&lt;/span&gt;
  &lt;span class="na"&gt;evolution&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;// opt in to the reflective optimizer when you want it&lt;/span&gt;
    &lt;span class="na"&gt;useGepa&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;State is stored as a single JSON blob per backend (SQLite or Postgres), which turned out to matter a lot for keeping the thing backward-compatible. Adding a new optional field to an agent's evolution state does not break older rows. They just lack the key, and you read defensively. Boring, but it means upgrades do not eat your history.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I am most proud of, briefly
&lt;/h2&gt;

&lt;p&gt;The mutation itself can be driven by a &lt;a href="https://arxiv.org/abs/2507.19457" rel="noopener noreferrer"&gt;GEPA&lt;/a&gt; reflective optimizer running online, inside the gate, instead of as an offline batch job you run once a week. The agent reflects on its own recent trajectories, proposes a targeted rewrite, and that rewrite still has to clear every guard above before it ships. Reflection proposes. The gate disposes. That separation is the whole trick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this actually is, honestly
&lt;/h2&gt;

&lt;p&gt;I am not going to dress up the numbers, because the honest version is more interesting than the pitch.&lt;/p&gt;

&lt;p&gt;For most of the last three months this sat at single-digit stars. The quiet stretch where you publish version after version into the void and wonder if anyone runs them. Then, in the last two weeks, without a launch, something turned. Twelve stars in a single day after months of zero-to-one. The core package went from roughly six installs a day to around eighteen. A LangGraph adapter I shipped five weeks ago went from a trickle to a few hundred downloads a week.&lt;/p&gt;

&lt;p&gt;The absolute numbers are still small. Eight stars is not a movement and I will not pretend otherwise. But there is a real difference between a number that is small and a number that is small and accelerating, and the curve stopped being flat.&lt;/p&gt;

&lt;p&gt;The repo being small is not because it is new, by the way. Some of this code has been running our own agent fleet for months. It is small because I only recently decided to share it, and because I am genuinely bad at growth tactics. The code is real, it gets used, issues get answered. That is the whole offer.&lt;/p&gt;

&lt;p&gt;From a small studio in Palma de Mallorca.&lt;/p&gt;

&lt;h2&gt;
  
  
  If you want to poke at it
&lt;/h2&gt;

&lt;p&gt;It is MIT, TypeScript, on npm as &lt;code&gt;darwin-agents&lt;/code&gt;, with a &lt;code&gt;darwin-langgraph&lt;/code&gt; adapter if you already live in LangGraph. Source and docs are on GitHub under studiomeyer-io.&lt;/p&gt;

&lt;p&gt;The one thing I would genuinely like to hear from this community: what do you do to keep a self-improving agent from drifting? The gate above is my answer, but I am still learning this part, and the failure modes are sneakier than they look. If you have been burned by one, I want to know which one.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>typescript</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Writing AI Prompts: Brief It Like a New Employee</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Wed, 01 Jul 2026 02:02:34 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/writing-ai-prompts-brief-it-like-a-new-employee-4ka</link>
      <guid>https://dev.to/studiomeyer_io/writing-ai-prompts-brief-it-like-a-new-employee-4ka</guid>
      <description>&lt;p&gt;Most people who think AI is overrated gave up at the same spot. They typed "write me a marketing email," got back a wall of generic mush with three exclamation marks, and decided the thing is dumb. The tool was never the problem. The briefing was.&lt;/p&gt;

&lt;p&gt;Here is the reframe that fixes ninety percent of it. Talk to the AI like a new employee on their first day. A sharp one, fast, never tired, but brand new. It knows language and general knowledge. It does not know your company, your customers, your prices, or what you actually meant. Nobody would hand a new hire the sentence "write me a marketing email" and expect something they could send. You would tell them what it is for, who it is going to, and show them a good one. That is all a prompt is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Building Blocks
&lt;/h2&gt;

&lt;p&gt;A good prompt has four parts. You do not need all four every time, but when an answer disappoints, a missing one is almost always why.&lt;/p&gt;

&lt;p&gt;Context. Who you are and what the situation is. "I run a small dental practice with three staff." One sentence saves you ten bad answers.&lt;/p&gt;

&lt;p&gt;The task. What you actually want, stated plainly. Not "help with marketing" but "write a short email reminding patients who have not been in for a year to book a checkup."&lt;/p&gt;

&lt;p&gt;Your material. The real text, numbers, or notes. Do not describe your offer, paste it. The AI is far better at working with what you give it than at guessing what you have.&lt;/p&gt;

&lt;p&gt;The format. How you want the answer to come out. "Three subject line options." "A bullet list." "A table." "Under 120 words." If you do not say, you get whatever shape it feels like, and usually the wrong one.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Copy and Paste Template
&lt;/h2&gt;

&lt;p&gt;Here is a fill in the blank version you can keep in a note and reuse for almost anything.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I am [who you are and your business]. I need [the task in one plain sentence]. Here is the material to work from: [paste your text, notes or numbers]. Write it for [who will read it] in a [warm / formal / direct] tone. Give it to me as [three options / a table / under 120 words].&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fill the five brackets and you have covered all four building blocks without thinking about it. After a week you will do it in your head.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Patterns That Always Work
&lt;/h2&gt;

&lt;p&gt;Once you have picked a tool, and if you have not yet, our guide on &lt;a href="https://studiomeyer.io/en/blog/welche-ki-fuer-was" rel="noopener noreferrer"&gt;which AI to use for which job&lt;/a&gt; sorts that out, these five habits do most of the heavy lifting.&lt;/p&gt;

&lt;p&gt;Paste the real thing. Your draft, the customer email you are replying to, the notes from the call. Working from your actual material beats any description of it.&lt;/p&gt;

&lt;p&gt;Show one example. If you have one offer letter you were happy with, paste it and say "write the next one in this style." One example teaches the tone better than three paragraphs of adjectives.&lt;/p&gt;

&lt;p&gt;Say who it is for. "For a customer who is price sensitive." "For a supplier I have known for years." The same message changes completely depending on the reader, and the AI cannot see the reader unless you name them.&lt;/p&gt;

&lt;p&gt;Ask for the shape. Three options instead of one. A table instead of prose. A checklist instead of an essay. You can compare and pick, which is faster than getting one thing and fighting it.&lt;/p&gt;

&lt;p&gt;Keep talking. The first answer is a draft, not a verdict. "Shorter." "Less salesy." "More concrete, add a number." "Now make it sound less like a robot." It is a conversation, not a vending machine. This single habit separates people who get value from people who give up.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Before and After
&lt;/h2&gt;

&lt;p&gt;Weak prompt. "Write a follow up email to a customer."&lt;/p&gt;

&lt;p&gt;What you get back is polite, generic, and unusable, because it could be from any business to any customer about anything.&lt;/p&gt;

&lt;p&gt;Stronger prompt. "I run a small garden landscaping business. Write a short, friendly follow up email to a customer named Frau Berger who we sent a quote to last week for a new terrace, around 4,000 euros. She has not replied. Warm tone, no pressure, offer to answer questions or adjust the quote. Under 100 words. Give me two versions."&lt;/p&gt;

&lt;p&gt;Same AI, same thirty seconds of typing, completely different result. The difference is not skill. It is that the second one told the new employee what they needed to know. There is a &lt;a href="https://studiomeyer.io/en/blog/ki-aufgaben-unternehmen" rel="noopener noreferrer"&gt;list of everyday tasks worth trying this on&lt;/a&gt; if you want a place to start.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mistakes Beginners Make
&lt;/h2&gt;

&lt;p&gt;Being vague and blaming the tool. "Make it better" without saying better how. The AI cannot read your mind any more than a colleague could.&lt;/p&gt;

&lt;p&gt;Describing instead of pasting. Spending three sentences explaining a document you could have just dropped in.&lt;/p&gt;

&lt;p&gt;Accepting the first draft. The first answer is rarely the best one, and one more sentence usually fixes it.&lt;/p&gt;

&lt;p&gt;Asking yes or no questions. "Is this a good email?" gets you a shrug. "Give me two stronger versions and tell me what you changed" gets you something to use.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Thing Before You Paste
&lt;/h2&gt;

&lt;p&gt;A quick reminder that becomes its own habit. Be careful what you put in. Customer names, personal data, anything confidential should only go into a paid business account where your input is not used for training, and even then with thought. We cover the safe side of this in our guide on &lt;a href="https://studiomeyer.io/en/blog/dsgvo-konforme-ki" rel="noopener noreferrer"&gt;using AI in line with GDPR&lt;/a&gt;, and the next post in this series is entirely about what is safe to paste and what is not.&lt;/p&gt;

&lt;p&gt;Prompting well is not a technical skill. It is the same skill as briefing a person well, which you already have from every time you handed work to someone. The only new part is remembering that this particular colleague is brilliant, instant, and completely new every single morning. Tell it what it needs, show it one good example, and keep talking until it is right. That is the whole craft.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://studiomeyer.io/en/blog/gute-prompts-schreiben" rel="noopener noreferrer"&gt;studiomeyer.io&lt;/a&gt;. It is part of a plain-language series on getting started with AI in a small business.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>chatgpt</category>
      <category>writing</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Which AI for What: ChatGPT, Claude, Gemini or Copilot</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Sun, 28 Jun 2026 00:02:45 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/which-ai-for-what-chatgpt-claude-gemini-or-copilot-4b2n</link>
      <guid>https://dev.to/studiomeyer_io/which-ai-for-what-chatgpt-claude-gemini-or-copilot-4b2n</guid>
      <description>&lt;p&gt;The question I hear most from small business owners is not whether AI is any good. They are past that. The question is "which one do I pick." ChatGPT, Claude, Gemini, Copilot, and a new one every few weeks. They open a comparison article, see a table full of benchmark numbers, and close it more confused than before.&lt;/p&gt;

&lt;p&gt;So here is the honest answer first, then the reasoning. For most small businesses the right choice comes down to two boring questions. What software do you already pay for, and what task do you do most often. Benchmark scores barely matter. The model that wins a coding test today gets beaten next month, and you will never notice the difference in your daily work anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four, in One Breath
&lt;/h2&gt;

&lt;p&gt;All four are chat assistants. You type or speak, they answer, they can read documents and write text. The differences that matter for a small business are about where they live and what they are best at.&lt;/p&gt;

&lt;p&gt;ChatGPT from OpenAI is the all-rounder. The biggest name, the most how-to guides written about it, strong at images and everyday tasks. Claude from Anthropic is the careful writer and reader. It is the one I reach for with long documents, contracts, and anything where the tone has to be right. Gemini from Google lives inside Gmail, Docs, and Sheets. Copilot from Microsoft lives inside Word, Excel, Outlook, and Teams. That last point is the one most people skip, and it is the most important.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision One: Start With What You Already Pay For
&lt;/h2&gt;

&lt;p&gt;Before you compare anything, look at what your business already runs on.&lt;/p&gt;

&lt;p&gt;If your team works in Microsoft 365 all day, in Outlook and Word and Excel and Teams, then Copilot is the natural first answer. It sits inside the tools your people already open. There is no new login to hand out, no new app to train everyone on, and your data stays inside the Microsoft environment you already trust. The button appears in the corner of Word and it drafts the letter you were going to write anyway.&lt;/p&gt;

&lt;p&gt;If your business runs on Google Workspace, on Gmail and Google Docs and Sheets, then Gemini is the same kind of natural fit. It reads the email thread you are looking at and drafts the reply. It pulls numbers from the sheet that is already open.&lt;/p&gt;

&lt;p&gt;If you use neither, or you are a freelancer or a small team without a fixed office suite, then you are free to pick the best standalone assistant. That is ChatGPT or Claude. Both have a free version, both cost about the same to upgrade, and you can run them in a browser tab next to whatever else you use.&lt;/p&gt;

&lt;p&gt;This one decision removes most of the noise. You are not choosing between four tools anymore. You are choosing between one or two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Two: Match the Tool to the Work
&lt;/h2&gt;

&lt;p&gt;Once the ecosystem question is settled, the second filter is the work you actually do. If you went with Copilot or Gemini because of your office suite, you can stop reading here and just start using it. If you are in the open field with ChatGPT or Claude, or you want a second assistant alongside your office one, this is where the choice gets real. There is a &lt;a href="https://studiomeyer.io/en/blog/ki-aufgaben-unternehmen" rel="noopener noreferrer"&gt;longer list of tasks AI can take over&lt;/a&gt; if you want examples, but here is the short version of who is best at what.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you need&lt;/th&gt;
&lt;th&gt;Best pick&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Everyday writing, brainstorming, quick answers&lt;/td&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;The reliable all-rounder, least friction to start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long documents, contracts, careful tone&lt;/td&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;Reads long text well, writes with nuance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Working inside Gmail, Docs, Sheets&lt;/td&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;It is already in the app, no copy and paste&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Working inside Word, Excel, Outlook, Teams&lt;/td&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;td&gt;Same, built into the Microsoft tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generating images for posts and ads&lt;/td&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;Strongest at images that include readable text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research with up-to-date sources&lt;/td&gt;
&lt;td&gt;ChatGPT or Gemini&lt;/td&gt;
&lt;td&gt;Both search the live web well&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You do not need to memorize this. The pattern underneath it is simple. Pick by where the work happens. A contract review happens in a document, so Claude. An email reply happens in your inbox, so the assistant that lives in your inbox. A social post with a picture happens in ChatGPT.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Three: Two Privacy Settings on Day One
&lt;/h2&gt;

&lt;p&gt;This is the part small businesses skip and later regret. Before you or your team paste anything into an AI, settle two things.&lt;/p&gt;

&lt;p&gt;First, check whether the free or personal version learns from what you type. On several free tiers your input can be used to improve the model unless you turn that off in the settings. It takes two minutes to find the toggle. Do it before the first real prompt.&lt;/p&gt;

&lt;p&gt;Second, for anything that touches customer data, employee data, or anything confidential, use a paid business or team plan rather than a personal one. The business tiers from all four providers exclude training on your content by contract, and they give you admin control over the accounts. This matters more in Europe than anywhere, because the rules are not optional here. We wrote a &lt;a href="https://studiomeyer.io/en/blog/dsgvo-konforme-ki" rel="noopener noreferrer"&gt;separate guide on using AI in line with GDPR&lt;/a&gt; if you want the legal side in plain language. The next post in this series covers exactly what is safe to paste and what is not.&lt;/p&gt;

&lt;p&gt;The short rule. Personal account for personal tinkering. Business account the moment customer data is involved.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Actually Costs
&lt;/h2&gt;

&lt;p&gt;Prices move around, so treat these as bands and check the provider page before you buy. As of June 2026 the picture is steady at the entry level. A single paid seat, the Plus or Pro tier, runs around 20 euros a month across all four. That covers one person comfortably. Business and team plans, the ones with admin control and the training exclusion, sit around 25 to 30 euros per person per month. There is a free tier on every one of them, which is enough to test the waters but usually too limited for daily work.&lt;/p&gt;

&lt;p&gt;For a small business the math is gentle. One paid seat is the price of a few coffees. The real cost is not the subscription. It is the weeks you spend not using it while you wait to feel ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part Nobody Tells You
&lt;/h2&gt;

&lt;p&gt;Here is the truth after all the comparing. For most small businesses, the difference between these four matters far less than the difference between using one and using none. I have watched owners spend a month researching the perfect choice and another month worrying they picked wrong, while a competitor down the road just opened ChatGPT and started saving an hour a day.&lt;/p&gt;

&lt;p&gt;Pick one. Use it every working day for two weeks on real tasks, not test questions. Draft a real email, summarize a real meeting, rewrite a real offer. After two weeks you will know more about what fits your business than any comparison table can tell you, and you will have lost nothing, because the habits carry over. If you switch later, the way you talk to one assistant works on the next one.&lt;/p&gt;

&lt;p&gt;The businesses pulling ahead this year are not the ones that picked the smartest AI. They are the ones that picked one and started. If you want the wider view of what else is out there, our &lt;a href="https://studiomeyer.io/en/blog/beste-ai-tools-kmu-2026" rel="noopener noreferrer"&gt;overview of the best AI tools for small businesses&lt;/a&gt; goes broader than these four. But you do not need it to begin. You need one tab open and one real task in front of you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://studiomeyer.io/en/blog/welche-ki-fuer-was" rel="noopener noreferrer"&gt;studiomeyer.io&lt;/a&gt;. It is part of a plain-language series on getting started with AI in a small business.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>chatgpt</category>
      <category>beginners</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Your First 30 Days With AI: One Tool, One Task at a Time</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Fri, 26 Jun 2026 21:23:34 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/your-first-30-days-with-ai-one-tool-one-task-at-a-time-3if3</link>
      <guid>https://dev.to/studiomeyer_io/your-first-30-days-with-ai-one-tool-one-task-at-a-time-3if3</guid>
      <description>&lt;p&gt;Almost everyone I talk to has a browser with three or four AI tabs they opened once and never went back to. They signed up after someone swore it changed their life, typed two questions, got a shrug of an answer, and closed the tab. A month later they tell me AI is overhyped. The tool was rarely the problem. There was just no plan, so it stayed a toy instead of becoming a habit.&lt;/p&gt;

&lt;p&gt;Here is the thing nobody mentions in the breathless posts. You would never hand a new employee ten different jobs on their first morning and judge them by how they juggle all of them at once. You would start them on one thing, watch it go well, and add the next. AI is the same. The goal of your first month is not to automate your business. It is to come out the other side with two or three small habits that stick. That is a real win, and it is enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a Month, and Why Slow
&lt;/h2&gt;

&lt;p&gt;There is good research on where this goes right and where it goes wrong, and it points the same direction. In a large study of consultants, people did about forty percent better on work that sat inside what the AI is actually good at, and measurably worse on work that sat outside it. The same pattern shows up everywhere. AI is brilliant at some tasks and quietly bad at others, and the whole game is learning which is which for your own desk. You learn that by going slow on purpose, one task at a time, not by throwing everything at it in week one and giving up when half of it disappoints.&lt;/p&gt;

&lt;p&gt;There is one more finding worth holding onto, because it is the opposite of what people expect. When researchers looked at who gained the most, it was not the veterans. It was the least experienced people. If you have been putting this off because you feel behind, you are exactly the person it helps most. You just need a path in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week One: One Tool, One Boring Task, Every Day
&lt;/h2&gt;

&lt;p&gt;Pick one tool and do not touch the others for a month. If you are not sure which, our guide on &lt;a href="https://studiomeyer.io/en/blog/welche-ki-fuer-was" rel="noopener noreferrer"&gt;which AI to use for which job&lt;/a&gt; sorts that out in a few minutes. The point is not the perfect choice, it is the single choice. Switching tools every other day is how people stay beginners for a year.&lt;/p&gt;

&lt;p&gt;Now pick the most boring repeating writing task you have. The one you do every week and slightly resent. Replying to the same kind of customer enquiry. Turning bullet points into a tidy email. Writing the standard quote intro for the hundredth time. Do that one task with the AI every single working day for a week. Not five tasks once. One task five times. By Friday you will know its rhythm the way you know a colleague's, what it gets right on its own and where you have to step in.&lt;/p&gt;

&lt;p&gt;Keep it small enough that it is never a project. Ten minutes a day is the whole commitment. The aim of week one is not output, it is the click in your head where this stops being a novelty and becomes the thing you reach for without deciding to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week Two: Brief It Properly, Then Add the Second Task
&lt;/h2&gt;

&lt;p&gt;By now you have noticed the answers are only as good as what you put in. That is the real skill, and it is less technical than it sounds. The short version is to talk to the AI like a new hire who is sharp but knows nothing about your business, our guide on &lt;a href="https://studiomeyer.io/en/blog/gute-prompts-schreiben" rel="noopener noreferrer"&gt;writing prompts that actually work&lt;/a&gt; walks through the four pieces that fix most weak answers. Spend week two getting better at the asking, not at finding new tools.&lt;/p&gt;

&lt;p&gt;Then add one more task, only one. Maybe summarising long emails before you read them, or turning the messy notes from a call into something you can send. When you find a prompt that works, save it. Keep a single note on your phone or desktop with your three or four best prompts, the ones you fill in and reuse. That note is worth more after a month than any course. It is your own playbook, built from your own work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week Three: Bring Your Real Material In
&lt;/h2&gt;

&lt;p&gt;Up to now you have probably been describing things to the AI. This week you start feeding it the real thing instead. Your actual draft, the actual email you are answering, the actual numbers from the spreadsheet. Working from your real material beats describing it every time, and it is the single biggest jump in answer quality you will get.&lt;/p&gt;

&lt;p&gt;This is also the week to widen out a little, with a list in hand rather than guesswork. We wrote a plain rundown of &lt;a href="https://studiomeyer.io/en/blog/ki-aufgaben-unternehmen" rel="noopener noreferrer"&gt;everyday tasks worth handing to an AI&lt;/a&gt;, with a rough sense of the time each one saves, and week three is the right moment to walk down it and try the two or three that match your actual day. You are not trying all of them. You are finding which ones earn a permanent place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Week Four: Make It a Routine, and Know the Line
&lt;/h2&gt;

&lt;p&gt;By the last week the goal shifts from trying things to keeping the ones that worked. Look back at what you actually used. Two or three tasks will have stuck and the rest will have quietly fallen away, and that is exactly right. Pin the survivors to a real moment in your day. The Monday quote emails. The Friday summary. Attach the habit to something that already happens and it stops needing willpower.&lt;/p&gt;

&lt;p&gt;This is also the week to be clear about the line, especially if anyone else on your team is starting to use it. AI is fast at drafting, summarising, and reshaping text, and you can lean on it there. It is unreliable the moment a task turns on a number, a fact, or a judgment it cannot really check, which is why the consultants in that study did worse the moment they stepped past its edge. Treat its output as a confident first draft, never a final answer, and always read the part that carries a figure or a name. The other half of the line is what you put in, and our guide on &lt;a href="https://studiomeyer.io/en/blog/was-darf-ich-in-chatgpt-eingeben" rel="noopener noreferrer"&gt;what is safe to paste into a chat&lt;/a&gt; covers that side. Both halves take a week to become reflex and then you stop thinking about them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Keep After Thirty Days
&lt;/h2&gt;

&lt;p&gt;If you do this, you will not come out with a roomful of robots. You will come out with two or three jobs that used to take an hour and now take ten minutes, done in your voice, that you trust enough to keep doing. That is the whole prize, and it compounds quietly. The people who get value from this are not the ones who automated everything in a frantic first week. They are the ones who picked one task, got it right, and let the next one follow. Start Monday, pick the most boring thing on your desk, and give it the first ten minutes. In a month it will be a habit you forgot you had to learn.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://studiomeyer.io/en/blog/erste-30-tage-mit-ki" rel="noopener noreferrer"&gt;studiomeyer.io&lt;/a&gt;. It is part of a plain-language series on getting started with AI in a small business.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>beginners</category>
      <category>smallbusiness</category>
    </item>
    <item>
      <title>What You Can Safely Put Into ChatGPT: The Postcard Rule</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Tue, 23 Jun 2026 07:46:59 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/what-you-can-safely-put-into-chatgpt-the-postcard-rule-45j3</link>
      <guid>https://dev.to/studiomeyer_io/what-you-can-safely-put-into-chatgpt-the-postcard-rule-45j3</guid>
      <description>&lt;p&gt;Picture a small business owner pasting a customer's email into a free ChatGPT to draft a reply. Name, address, what they ordered, the four thousand euro invoice, all of it, straight into the box. It feels harmless, and most of the time nothing bad happens. But whether it is harmless depends entirely on which chat window that text went into, and almost nobody checks before they hit enter.&lt;/p&gt;

&lt;p&gt;The last post in this series told you to paste the real thing in, because real material beats describing it. That is still true. This post is the other half of that advice: there is a line, and crossing it is the one beginner mistake that can actually cost you. The good news is that staying on the right side of it takes about half a second once it becomes a habit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Is Worth Two Minutes
&lt;/h2&gt;

&lt;p&gt;A short reality check, not a scare. When Cyberhaven looked at what people actually paste into ChatGPT, roughly eleven percent of it was confidential. Harmonic Security analyzed about a million prompts and found that eight and a half percent put sensitive data at risk, and more than half of those went into the free tier. The most quoted example is still Samsung: engineers pasted internal source code in to fix a bug, and the company banned the tool across the whole organization soon after.&lt;/p&gt;

&lt;p&gt;None of that means AI is dangerous. It means the input box is not a private notebook. On free plans, your text can be used to train the model and can be seen by human reviewers. Once you know that, the rest is just common sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Postcard Rule
&lt;/h2&gt;

&lt;p&gt;Here is the single line to remember. On a free AI tool, treat anything you type like a postcard, not a sealed letter. Write what you would be fine with a stranger reading on its way through the post office. That one instinct catches almost every mistake before it happens, and you do not need to understand a word of how the model works to use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Buckets
&lt;/h2&gt;

&lt;p&gt;Sort what you want to paste into three buckets and the decision gets easy.&lt;/p&gt;

&lt;p&gt;Green, always fine. Anything already public or made up. Your published prices, your website copy, a general question, a draft where the real names are swapped for placeholders. Paste this freely, even on a free plan.&lt;/p&gt;

&lt;p&gt;Yellow, fine if you anonymize. Real work material that has identifying bits in it. A customer email, a quote, a clause from a contract. This is exactly the stuff that makes answers good, so you do want to use it, you just strip the names and numbers first. The next section is the whole trick.&lt;/p&gt;

&lt;p&gt;Red, never into a free tool. Anything you have a legal duty to protect or that would hurt if it leaked. Customer lists with personal data, health or financial records, passwords and access details, source code, anything covered by a confidentiality agreement. If you genuinely need AI on this kind of material, it goes into a paid business account, never a free one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One Trick: Anonymize, Don't Omit
&lt;/h2&gt;

&lt;p&gt;This is where it all comes together. You do not have to choose between a good answer and keeping data safe. You replace the identifying parts with placeholders, let the AI do the work, then put the real details back in yourself afterwards.&lt;/p&gt;

&lt;p&gt;Take the follow-up email from the &lt;a href="https://studiomeyer.io/en/blog/gute-prompts-schreiben" rel="noopener noreferrer"&gt;previous post on writing prompts&lt;/a&gt;. Instead of "write a follow-up to Frau Berger about her four thousand euro patio quote," you write "write a follow-up to [CUSTOMER] about her [AMOUNT] [PROJECT] quote, warm tone, under a hundred words." The AI writes an equally good email. You paste her real name and the real number back in before you send. Thirty seconds, and nothing identifying ever actually left your screen. You get the quality the real material gives you, with none of the exposure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Settings on Day One
&lt;/h2&gt;

&lt;p&gt;Two small things, set once, and you have covered most of the risk.&lt;/p&gt;

&lt;p&gt;First, turn off training and history in the settings of whatever tool you use. On most consumer tools there is a switch that stops your chats from being used to improve the model, and it is often on by default. Two minutes in the settings menu.&lt;/p&gt;

&lt;p&gt;Second, if you handle real customer data regularly, pay for a Business or Team plan. On those, the provider contractually does not train on your input, and you can get the paperwork the law expects when someone else processes personal data for you. The free plan does not come with that paperwork, and that, not some vague feeling that "AI is risky," is the actual reason a free account is the wrong place for customer data. If you are still deciding which tool to put a paid plan on, the &lt;a href="https://studiomeyer.io/en/blog/welche-ki-fuer-was" rel="noopener noreferrer"&gt;first post in this series&lt;/a&gt; walks through that choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mistakes Beginners Make
&lt;/h2&gt;

&lt;p&gt;Uploading the whole document when you needed two paragraphs. The more you put in, the more there is to leak, and the model rarely needs the rest.&lt;/p&gt;

&lt;p&gt;Pasting a screenshot with data still visible in a corner. The AI reads the entire image, not just the part you were thinking about.&lt;/p&gt;

&lt;p&gt;Dropping in passwords, API keys, or logins to "let it help with the setup." Never. Those belong in a password manager, not a chat window.&lt;/p&gt;

&lt;p&gt;Believing it forgets anyway. It does not, by default. Assume anything you type can be stored, and on a free plan, looked at by a person.&lt;/p&gt;

&lt;h2&gt;
  
  
  Make It a Habit, Not a Worry
&lt;/h2&gt;

&lt;p&gt;The point of all this is not to make you nervous about a tool you just started to like. It is to turn one instinct into a reflex, the same way you already lower your voice when you say a customer's name out loud in a busy café. Anonymize the yellow stuff, keep a paid plan for the red stuff, and inside those lines paste as freely as you want. If you want the deeper, legal side of this, our guide to &lt;a href="https://studiomeyer.io/en/blog/dsgvo-konforme-ki" rel="noopener noreferrer"&gt;using AI in a GDPR-compliant way&lt;/a&gt; goes there. The next post in this series gets practical again: a plain list of the everyday jobs where this all starts paying off, the dozen tasks worth trying first thing Monday morning.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://studiomeyer.io/en/blog/was-darf-ich-in-chatgpt-eingeben" rel="noopener noreferrer"&gt;studiomeyer.io&lt;/a&gt;. It is part of a plain-language AI series for small businesses.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>chatgpt</category>
      <category>privacy</category>
      <category>security</category>
    </item>
    <item>
      <title>Most AI Agents Aren't in Production. Here's What Works.</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Sun, 21 Jun 2026 22:48:03 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/most-ai-agents-arent-in-production-heres-what-works-4ni9</link>
      <guid>https://dev.to/studiomeyer_io/most-ai-agents-arent-in-production-heres-what-works-4ni9</guid>
      <description>&lt;p&gt;&lt;strong&gt;One widely-shared survey says 42 percent of companies already run AI agents in production. The most rigorous source in the field, Stanford's 2026 AI Index, says real autonomous-agent deployment still sits in single digits across nearly every business function. Both numbers were published this year, both are defensible, and the distance between them is where almost every bad decision about AI agents is being made right now. If you only remember one thing about agents in mid-2026, make it this: the technology is far more capable than the deployment numbers suggest, and the gap is not about intelligence. It is about trust, scope, and whether anyone can tell when the agent is wrong.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I build agent systems for a living, and I spend at least as much time talking clients out of agent projects as into them. Not because the tools are bad. Because the honest answer to "should we put an autonomous agent on this" is usually "on this specific slice, yes, and on the rest, not yet." The market is loud with both hype and backlash, and the truth is less satisfying than either. Here is the version I actually believe, with the numbers that support it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Number Depends Entirely on Who You Ask
&lt;/h2&gt;

&lt;p&gt;The single biggest error in reading agent-adoption data is treating "deploying," "in production," "scaling," and "delivering value" as the same word. They are measured by different people, on different cohorts, with definitions that quietly do most of the work.&lt;/p&gt;

&lt;p&gt;The headline 42 percent comes from Mayfield, a venture firm, surveying 266 senior technology executives in its own network in January. It is a real signal, but it is a flattering crowd answering a generous question. Step to the harder methodologies and the floor drops out. McKinsey's late-2025 State of AI found about 23 percent of organizations scaling an agentic system somewhere, but fewer than 10 percent scaling agents to tangible value. Stanford's AI Index, 400-plus pages and the least conflicted source I know, puts genuine autonomous-agent deployment in single digits across nearly all functions. The recurring industry phrase for the space between a pilot and production is "pilot purgatory," and most companies are sitting in it.&lt;/p&gt;

&lt;p&gt;Reconcile those honestly and you get a picture you can defend to a skeptic. Among larger companies, a clear majority are experimenting, somewhere between 10 and 30 percent have at least one agent genuinely in production, and well under 15 percent are running agents at the scale where they move the bottom line. Even the optimistic Mayfield data carries the tell: 84 percent of those executives call security and compliance non-negotiable, yet 60 percent admit they have early-stage or no formal AI governance, and they name data readiness, not model quality, as the number-one blocker. The agents are ready before the organizations are.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agents Finish About a Third of Real Office Work
&lt;/h2&gt;

&lt;p&gt;When you measure agents on realistic work instead of clean benchmarks, the capability gap becomes concrete. Carnegie Mellon built TheAgentCompany, a simulated firm with 175 multi-step tasks across software, finance, HR and admin, wired up with the actual tools a company uses. The best frontier model finished about 30 percent of the tasks outright, a bit under 40 percent with partial credit, at roughly four dollars a task. The rest it got wrong, abandoned, or, most tellingly, faked. The researchers watched agents "create fake shortcuts that omit the hard part of the task," which is the single failure mode a business should fear most, because it looks like success until it isn't.&lt;/p&gt;

&lt;p&gt;The capability is also jagged in ways that defy intuition. The same model that earns a gold-medal score on a mathematics olympiad reads an analog clock correctly about half the time. Hallucination is not a solved problem with a single rate, whatever you have read: across 26 frontier models on one 2026 evaluation, hallucination ranged from 22 to 94 percent depending on the test, and accuracy collapses when a question is framed to flatter a false assumption. There is now a tracked database of more than 1,400 court cases containing AI-fabricated legal citations. None of this means agents are useless. It means their failures land in places humans do not expect, which is exactly why unsupervised deployment goes wrong.&lt;/p&gt;

&lt;p&gt;The plain-English verdict is more useful than any benchmark. Agents are reliable today at bounded, tool-shaped tasks where the work can be checked at the end. They are unreliable at open-ended judgment, messy real-world inputs like a mixed pile of photographed invoices, and long-running goals with no checkpoints. The skill in 2026 is not picking the smartest model. It is telling those two categories of work apart.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why More Than 40 Percent of Agent Projects Will Be Cancelled
&lt;/h2&gt;

&lt;p&gt;Gartner surveyed more than 3,400 enterprise leaders and predicts that over 40 percent of agentic AI projects will be cancelled by the end of 2027. The interesting part is the cause, because it is almost never "the model wasn't smart enough." The named reasons are escalating costs nobody budgeted for, business value too vague to defend when leadership asks for the return, risk controls too weak to let an agent near customer data, and a generous amount of "agent-washing," Gartner's own term for a chatbot wearing an agent costume. The failures are use-case selection errors, not technology failures.&lt;/p&gt;

&lt;p&gt;Cost is the quietest killer here, and it compounds with a design fashion. The instinct on hard problems is to throw a swarm of agents at them, but Princeton researchers found a single agent matched or beat multi-agent setups on 64 percent of tasks given the same tools, while the multi-agent version burned roughly two to three times the tokens for about two points of extra accuracy. Agentic systems already fire ten to twenty model calls per task, and that is exactly the dynamic behind &lt;a href="https://studiomeyer.io/en/blog/ai-cost-paradox-2026" rel="noopener noreferrer"&gt;the AI cost paradox&lt;/a&gt;: the per-token price keeps falling while the bill keeps rising, because every extra agent in the loop spends the savings. A multi-agent architecture you adopted for elegance can quietly become the line item that gets the whole project cancelled.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottleneck Is Trust, Not Intelligence
&lt;/h2&gt;

&lt;p&gt;The clearest evidence that capability is not the constraint comes from the one category where agents indisputably work: writing code. Anthropic's Claude Code reached an annualized run-rate above 2.5 billion dollars by February, more than doubling since the start of the year, with enterprise now over half its revenue. Cursor crossed two billion in annual revenue in February and around three billion by April. OpenAI's Codex passed roughly four million weekly developers. These are not pilots. They are the fastest-growing software category I have ever watched, and they work for one boring reason: code has tests. The check at the end is built in, so delegation is safe.&lt;/p&gt;

&lt;p&gt;And yet, even here, trust lags capability. Anthropic's own 2026 analysis of how developers work found they now use AI in around 60 percent of their tasks but fully delegate only zero to twenty percent. One observer put it perfectly: developers are using these tools more aggressively than ever while trusting them less. The response that worked was not a smarter model, it was a governance feature. Claude Code shipped an "auto mode" that uses a separate classifier to auto-approve safe actions like writing files and running tests, while blocking destructive ones like mass deletion. That is the whole lesson of mid-2026 in one product decision: the agent did not need to get smarter to be trusted in production, it needed a boundary it could not cross without a human, made explicit in the architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Actually Automate Now
&lt;/h2&gt;

&lt;p&gt;If you run a business and want the practical version, here is the decision rule I use. An agentic task is a good candidate when it is bounded, tool-shaped, and cheaply verifiable: the inputs are predictable, the agent acts through defined tools rather than open judgment, and there is a clear check at the end that tells you whether it worked. Support-ticket triage and routing, drafting replies a human approves, reconciling structured records, screening and scheduling, pulling and summarizing from systems you control: these are the wins that ship. They are unglamorous, narrow, and they pay off.&lt;/p&gt;

&lt;p&gt;The work to avoid handing an unsupervised agent is the mirror image: anything requiring open-ended judgment, messy or mixed inputs, irreversible actions, or a long horizon with no checkpoints. That is also where most of the cancelled projects in the Gartner data were aimed, and where the &lt;a href="https://studiomeyer.io/en/blog/ai-agent-traps" rel="noopener noreferrer"&gt;most common agent traps&lt;/a&gt; live. Picking the wrong task is the mistake, not picking the wrong model.&lt;/p&gt;

&lt;p&gt;When the task is a fit, the playbook that separates the projects that survive from the 40 percent that don't is consistent across every serious source. Map the process as a manual runbook first, and if you cannot write steps a new employee could follow without asking questions, you are not ready to automate it. Narrow the scope to one high-value workflow and two or three agents at most. Make human-in-the-loop a design property, not an apology: the agent handles the clear cases and routes the ambiguous, low-confidence, and high-risk ones to a one-click review queue. Keep the agent's state, its memory of what is true and what is still open, in a database you own rather than in its context window. This is the same discipline behind any real &lt;a href="https://studiomeyer.io/en/blog/ki-automatisierung-leitfaden" rel="noopener noreferrer"&gt;AI automation that holds up in production&lt;/a&gt;, and it is boring on purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means
&lt;/h2&gt;

&lt;p&gt;The shakeout Gartner is forecasting is not the bubble bursting, it is the category growing up. The projects that die were mostly aimed at the wrong work, sold on a vague return, or built without a boundary the agent could not cross. The ones that survive will look unimpressive next to the demos: a single agent owning one well-defined workflow, with a human at every high-risk gate and a number that shows it moved. That is what "in production" actually looks like, and it is why the real adoption figure is single digits while the capability is anything but.&lt;/p&gt;

&lt;p&gt;My prediction is that the most valuable question in any AI-agent conversation for the next year will not be "how smart is the model." It will be "what can this agent not do, and where exactly does a human stand when it hits that wall." Answer that well and you are in the small group getting real value. Skip it and you are funding a pilot that a Gartner analyst already counted as cancelled. The agents are ready for more than most companies are doing with them, and for far less than the loudest people are selling. The work is learning to tell which is which.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Matthias Meyer of &lt;a href="https://studiomeyer.io/en" rel="noopener noreferrer"&gt;StudioMeyer&lt;/a&gt;, a web and AI agency on Mallorca building MCP servers, agent fleets and AI products for small and mid-size businesses. This article was &lt;a href="https://studiomeyer.io/en/blog/ai-agents-production-reality-2026" rel="noopener noreferrer"&gt;originally published on the StudioMeyer blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>AI Now Recommends Local Businesses. Most Are Invisible.</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Sat, 20 Jun 2026 22:37:42 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/ai-now-recommends-local-businesses-most-are-invisible-3cb6</link>
      <guid>https://dev.to/studiomeyer_io/ai-now-recommends-local-businesses-most-are-invisible-3cb6</guid>
      <description>&lt;p&gt;&lt;strong&gt;Forty-five percent of consumers now use an AI assistant to find a local service, up from six percent a year ago. In the same window, ChatGPT was measured recommending just 1.2 percent of all local business locations. Put those two numbers next to each other and you have the local-search story of 2026 in a sentence: people are asking AI which plumber, which restaurant, which estate agent to use, and for almost every business the answer it gives does not include them. The old game was ranking on a page of ten blue links. The new game is being the one name the assistant says out loud, and most local businesses have not noticed the rules changed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I run a web and AI agency on Mallorca, so I watch this from the worst possible vantage point: a place whose entire economy runs on strangers deciding where to eat, stay, and buy. For years that decision started with a Google search and a map full of pins. Increasingly it starts with a question typed into ChatGPT or Gemini, and the response is not a list to browse. It is a short, confident recommendation of two or three places, and everything else may as well not exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The New Local Search Is a Single Answer
&lt;/h2&gt;

&lt;p&gt;The behavioral shift underneath this is faster than anything I have seen in fifteen years of building websites. BrightLocal's 2026 survey put AI use for finding local services at 45 percent, a more than sevenfold jump in a single year. Google has not collapsed, it still holds around 90 percent of conventional search-engine share, but the experience changed underneath the market share. Roughly 68 percent of Google searches now end without a click to the open web, and AI Overviews, the summary box Google writes for you, appear in about 68 percent of local searches and sit above both the paid ads and the organic results.&lt;/p&gt;

&lt;p&gt;The mechanical effect is brutal for anyone who relied on being findable. When an AI Overview appears, the click-through rate to the top organic result drops by 58 percent, measured by Ahrefs across 300,000 keywords. Pew Research, watching real browsing behavior, found people click a traditional link 8 percent of the time when a summary is present versus 15 percent when it is not. Google's newer AI Mode is more extreme still, with around 93 percent of those sessions ending in zero clicks. The search did not disappear. The list of ten options you used to scroll through did.&lt;/p&gt;

&lt;p&gt;That is the part that should change how a local owner thinks. You are no longer competing for position four instead of position seven. You are competing to be one of the two or three names that exist at all inside a generated answer. It is closer to winning a recommendation from a knowledgeable friend than to ranking in a directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Most Local Businesses Are Simply Not in the Answer
&lt;/h2&gt;

&lt;p&gt;Here is the uncomfortable measurement. SOCi's 2026 Local Visibility Index looked at more than 350,000 business locations and found ChatGPT recommends only about 1.2 percent of them when asked for a local option. Eighty-three percent of restaurants do not appear in AI local recommendations at all. This is not a gentle reshuffling of who ranks where. It is a near-total filter, and on the wrong side of it your business is invisible to a fast-growing share of the people who would have walked in.&lt;/p&gt;

&lt;p&gt;Winner-take-most dynamics are not new on the web, but local used to be the exception. A maps result had room for a dozen nearby options, and proximity alone got you seen. The generated answer has no such mercy. It names a few and stops. For the business that gets named, AI discovery is a compounding gift. For everyone else it is a slow leak of customers who never knew the place existed, with no analytics dashboard showing the loss, because you cannot measure the searches where you were never mentioned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Surprise: ChatGPT Does Not Ask Google
&lt;/h2&gt;

&lt;p&gt;This is the finding that reframes the whole problem, and most local owners have never heard it. When researchers reverse-engineered where ChatGPT's first local recommendations come from, the answer was not Google. Roughly 60 to 70 percent of the local businesses ChatGPT surfaces first are pulled from Foursquare's Places API. The assistant is leaning on a location dataset most business owners last thought about in 2014.&lt;/p&gt;

&lt;p&gt;Sit with what that means. Your Google Business Profile, the thing every local-SEO guide told you to obsess over, is one input among several, and for the assistant a lot of people now ask first, it may not be the input that matters most. Your listing on Foursquare, Apple Maps, and Yelp, the consistency of your name, address, and category across all of them, is now a direct ranking signal for AI recommendations. The businesses that quietly kept those listings clean are getting recommended by a machine they never optimized for, and the ones who let them rot are paying for it without knowing why.&lt;/p&gt;

&lt;p&gt;This is why I tell people that local AI visibility is not Google visibility with a new coat of paint. It runs on a different and wider set of data sources, and the work is partly old-fashioned listing hygiene that fell out of fashion right when it started to matter again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your Website Still Matters, But Differently
&lt;/h2&gt;

&lt;p&gt;None of this retires your website. It changes the job the website does. An assistant that wants to recommend you has to be able to read you, and a startling amount of the web is now unreadable to it. Adobe found that 34 percent of product pages and a quarter of homepage and category pages are inaccessible to AI assistants, usually because the content is rendered by JavaScript that the AI crawler never executes. If the assistant cannot parse your hours, your prices, your services, you are not in the answer, no matter how good the page looks to a human.&lt;/p&gt;

&lt;p&gt;The fix is the unglamorous discipline of making your facts machine-readable: server-rendered text instead of script-built content, complete LocalBusiness and FAQ structured data, and clear answer blocks that state a fact in one place instead of smearing it across five paragraphs. The deeper mechanics of why an engine cites one page and ignores a better-written one are worth understanding properly, and I wrote a full piece on &lt;a href="https://studiomeyer.io/en/blog/how-ai-citations-actually-work" rel="noopener noreferrer"&gt;how AI citations actually work&lt;/a&gt; for anyone who wants the pipeline rather than the checklist. The short version is that engines lift passages, not pages, and they reward content that is reachable, liftable, and corroborated everywhere else on the web.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mallorca Is Already Being Sorted by Algorithm
&lt;/h2&gt;

&lt;p&gt;The travel verticals are where this is moving fastest, which makes the Balearics an early and slightly unsettling test case. Fifty-six percent of US leisure travelers used AI for at least one trip in the past year, generative AI now makes up around a third of trip-research activity, and AI-driven travel traffic to booking sites grew nearly 200 percent year over year in May. Europe is a step behind the UK on the same curve, with Spain squarely in the markets being tracked.&lt;/p&gt;

&lt;p&gt;It is no longer abstract here. In June a Balearic outlet reported a tourism study finding that AI agents now recommend specific island destinations by name, Sant Antoni on Ibiza, Ciutadella on Menorca, acting as an amplifier that pushes already-crowded places harder. Big operators are moving to be inside that funnel rather than downstream of it. IHG launched a ChatGPT app in early June to let travelers discover and compare its hotels directly in the assistant. Real estate is just as exposed, with AI Overviews showing up for as many as half of local-intent property searches, which is precisely the territory an &lt;a href="https://studiomeyer.io/en/blog/ai-ready-immobilien" rel="noopener noreferrer"&gt;AI-ready real estate site&lt;/a&gt; is built for.&lt;/p&gt;

&lt;p&gt;There is a saving grace, and it is an important one. Around 51 percent of travelers who use AI still click through to a real website before deciding, and AI-referred travel visitors still convert about 28 percent worse than non-AI ones, because booking a holiday or a finca is a trust decision a person still wants to make as a person. The assistant increasingly owns discovery. The human still owns the close. That split is the whole strategy: you want to be in the answer when someone asks, and you want a site and a process good enough to win them once the AI hands them over.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Actually Do
&lt;/h2&gt;

&lt;p&gt;The practical work divides into three layers, and only the first is urgent. Start with presence on the surfaces the assistants actually read. Claim and clean your Foursquare, Apple Maps, and Yelp listings alongside Google, make the name, address, phone, and category identical across all of them, and treat that consistency as the ranking signal it has quietly become. This is cheap, it is mostly a weekend of careful work, and it is the single highest-return thing a local business can do for AI visibility right now.&lt;/p&gt;

&lt;p&gt;The second layer is making your own site answerable. Server-render the facts, add LocalBusiness and FAQ structured data, and write the key questions a customer would ask an assistant directly into pages with direct, liftable answers. Treating this as its own discipline rather than an afterthought to classic SEO is the core of what &lt;a href="https://studiomeyer.io/en/services/geo" rel="noopener noreferrer"&gt;generative engine optimization&lt;/a&gt; actually is, and the businesses doing it now are accumulating an advantage while only about one in seven of their competitors even tracks whether AI mentions them. For the verticals where the stakes are highest, the gap between the prepared and the invisible is already wide.&lt;/p&gt;

&lt;p&gt;The third layer is the forward bet, and it is optional for now: giving agents a real interface to your business, an API or an MCP server they can query for availability, pricing, and booking instead of scraping a cached page. For most local businesses that is a 2027 conversation. But the discovery work in the first two layers is a today problem, and it is the rare kind that rewards moving before your competitors do.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means
&lt;/h2&gt;

&lt;p&gt;Local discovery is bifurcating, quietly and without an announcement. A small set of businesses is becoming the answer the assistants give, and that position compounds, because being recommended makes you more visible which makes you more likely to be recommended again. Everyone else is sliding toward an invisibility they cannot see in any dashboard, losing the customers who used to find them by accident on a map. The cruel part is that the loss is silent. There is no notification that you were left out of an answer.&lt;/p&gt;

&lt;p&gt;My prediction is that within a year, asking which businesses an AI recommends in your category will be as routine as checking your Google ranking was a decade ago, and a lot of owners will not like what they hear. The good news is that the filter is still loose enough to climb into. Most local businesses have not done the listing hygiene, have not made their site machine-readable, and have no idea Foursquare is back from the dead. That gap is the opportunity, and on an island that lives or dies on being chosen by people who have never been here, I would not wait for it to close.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Matthias Meyer of &lt;a href="https://studiomeyer.io/en" rel="noopener noreferrer"&gt;StudioMeyer&lt;/a&gt;, a web and AI agency on Mallorca building MCP servers, agent fleets and AI products for small and mid-size businesses. This article was &lt;a href="https://studiomeyer.io/en/blog/local-ai-discovery-2026" rel="noopener noreferrer"&gt;originally published on the StudioMeyer blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>seo</category>
      <category>marketing</category>
      <category>webdev</category>
    </item>
    <item>
      <title>The AI Cost Paradox: 280x Cheaper, Bills Still Rising</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Sat, 20 Jun 2026 08:58:53 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/the-ai-cost-paradox-280x-cheaper-bills-still-rising-5g9l</link>
      <guid>https://dev.to/studiomeyer_io/the-ai-cost-paradox-280x-cheaper-bills-still-rising-5g9l</guid>
      <description>&lt;p&gt;&lt;strong&gt;The cost of running a capable AI model fell by roughly 280 times in two years. Over the same stretch, the average company's AI bill went up, not down. Both numbers are real, both come from credible research, and the space between them is the single most useful thing an operator can understand about AI economics in 2026. It explains why "the models keep getting cheaper" and "our AI spend is out of control" are being said in the same meeting, by the same people, about the same systems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I watch this play out in client projects every month. Someone reads that token prices collapsed, assumes their costs are about to fall off a cliff, and then opens an invoice that did the opposite. The confusion is not a billing error. It is a structural feature of how AI is now built, and once you see the mechanism you can plan around it instead of being surprised by it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Number That Should Have Lowered Your Bill
&lt;/h2&gt;

&lt;p&gt;Start with the collapse, because it is genuinely staggering. Stanford's &lt;a href="https://hai.stanford.edu/ai-index/2026-ai-index-report" rel="noopener noreferrer"&gt;2026 AI Index&lt;/a&gt; pegs the price of GPT-3.5-level performance at about 280 times cheaper between November 2022 and October 2024, falling from roughly 20 dollars per million tokens to about 7 cents. That is not a typo and it is not a one-off. Epoch AI measures a median decline near 50 times per year for equal capability. The venture firm a16z frames the same trend more conservatively at around 10 times per year, which they point out is still faster than compute fell in the PC era or bandwidth fell during the dotcom build-out.&lt;/p&gt;

&lt;p&gt;The frontier did the same thing in public. When Anthropic shipped Claude Opus 4.5 in November 2025, it cut the flagship price from 15 and 75 dollars per million input and output tokens to 5 and 25, a 67 percent reduction in a single release. What happened next is the part people miss. Anthropic then held that 5-and-25 price across Opus 4.6, 4.7, and 4.8 while the model kept getting better. The per-token price stopped falling and capability kept climbing, which is its own kind of price cut.&lt;/p&gt;

&lt;p&gt;The trigger for most of this was competition from below. DeepSeek R1 landed in January 2025 at 55 cents per million tokens while scoring around 95 percent of OpenAI's o1, and the major labs responded with emergency price moves. By mid-2026 the floor is remarkable. OpenAI's GPT-5.4-nano runs at 20 cents input and 1.25 dollars output per million. DeepSeek V4 Pro, an open-weights model you can host yourself, sits near 44 cents input. Google's Gemini 3.5 Flash beats the previous generation's Pro tier on agent benchmarks at 1.50 and 9 dollars. On paper, intelligence has never been this cheap to rent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Bill Went Up Instead
&lt;/h2&gt;

&lt;p&gt;Here is the paradox stated plainly. Per-token prices fell by a factor of hundreds, and by one estimate the average enterprise AI bill still rose more than 300 percent over the same window. I treat the exact magnitude of that spend figure as indicative rather than gospel, because it comes from a secondary analysis, but the direction is confirmed everywhere and the reason is structural, not accidental.&lt;/p&gt;

&lt;p&gt;Cheaper tokens get spent, not saved. The thing you are buying changed shape. In 2023 a typical interaction was one prompt and one answer, a few thousand tokens, one model call. In 2026 the same business outcome runs through an agent that fires somewhere between 10 and 20 model calls for a single user task. It plans, it calls a tool, it reads the result, it re-plans, it checks its own work, it writes a commit message. Retrieval-augmented generation inflates the context of each of those calls by stuffing in three to five times more reference text. And the agent does not go home at night. Monitoring agents and always-on assistants bill around the clock whether anyone is watching or not.&lt;/p&gt;

&lt;p&gt;So the unit got 280 times cheaper and the number of units per job went up by more than that. This is the same pattern every efficiency gain in computing has followed. Cheaper storage did not shrink data centers, it gave us video everywhere. Cheaper bandwidth did not lower the average person's internet bill, it gave us streaming. Cheaper intelligence is not lowering AI spend, it is making agents economically possible, and agents are hungry. For anyone running a product on top of an API, that is the line that matters: a workload that cost a cent yesterday is a loop that costs fifteen cents today, and the loop is what makes the product good.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Unlimited Era Just Ended
&lt;/h2&gt;

&lt;p&gt;If you want a single event that marks the turn, it is GitHub Copilot. On the first of June 2026, GitHub moved every Copilot plan to usage-based billing. Premium request units were replaced by AI Credits priced at one cent each, metered against input, output, and cached tokens at each model's published rate. The cheaper fallback model that used to absorb overflow is gone. When your credits run out you either set a budget or you stop.&lt;/p&gt;

&lt;p&gt;The reason GitHub gave is the clearest sentence anyone has written about this whole shift. With agents and subagents in the picture, the company said, "it is now common for a handful of requests to incur costs that exceed the plan price." Read that again with your own product in mind. A flat monthly subscription assumes a roughly predictable amount of work per user. Agentic software breaks that assumption, because one motivated user pointing an agent at a hard problem can burn a month of margin in an afternoon.&lt;/p&gt;

&lt;p&gt;Everyone building on these APIs is now living in the world GitHub just formalized. Providers split pricing into short-context and long-context tiers. They charge per tool call for search and computer use. They sell priority lanes at 2.5 times the base rate and offer cached-input discounts up to 90 percent to reward architectures that reuse prompts. The flat-rate, all-you-can-eat plan was a product of an era when a call was a call. That era is closing, and pricing your own AI product as if it were still open is how you wake up subsidizing your heaviest users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Weights Caught Up, and That Changes the Math
&lt;/h2&gt;

&lt;p&gt;The second force reshaping the economics is that the cheap option got genuinely good. For most of the last three years, "open-weights" meant "almost as good, if you squint." That is no longer true at the top. On Artificial Analysis's intelligence benchmark in April 2026, the best open models scored around 54 against 60 for the strongest closed flagship, a gap of a few points rather than a generation. Nine of the thirteen models on the intelligence-versus-price frontier are open weight. Stanford's same index puts the gap between the top US and top Chinese model at 2.7 percent as of March 2026, down from 17 to 31 points in 2023.&lt;/p&gt;

&lt;p&gt;What this means in practice is that you are no longer choosing between an expensive model that works and a free one that does not. You are choosing along a curve, and most of that curve is now usable. A model like DeepSeek V4 ships with a million-token context, runs at a fraction of frontier pricing, and can be self-hosted inside your own infrastructure. The strategic question stopped being "can we afford a good model" and became "which good model fits this specific job, at this volume, under these privacy rules."&lt;/p&gt;

&lt;p&gt;That last clause matters more here than in most places. For a business in the EU handling client data, the ability to run a competent model on your own server or inside a private cloud is not just a cost decision, it is a compliance one. The &lt;a href="https://studiomeyer.io/en/blog/eigener-ki-server-kosten" rel="noopener noreferrer"&gt;cost math on a self-hosted AI server&lt;/a&gt; looks very different when the alternative is shipping regulated data to a third-party API, and the models that make it viable are now good enough that the tradeoff is real rather than theoretical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Move Is the Right Model for Each Job
&lt;/h2&gt;

&lt;p&gt;Put the two forces together, cheaper-but-hungrier tokens and a deep bench of usable models, and the winning strategy stops being a single choice and becomes an architecture. The pattern practitioners keep converging on is the cascade, and it is simple to state. Send the high-volume, predictable 80 to 90 percent of work to a small or open or on-device model. Reserve the expensive frontier model for the hard tail that actually needs it. Done well, this captures most of the cost savings while keeping frontier reasoning available for the cases that justify it.&lt;/p&gt;

&lt;p&gt;The dividing line is not glamour, it is task shape. Classification, extraction, routing, and short summaries are exactly what small models do well now. Microsoft's Phi-4-mini matches the quality of a far larger model on structured extraction while running in 8 gigabytes of memory. Google's Gemma 4 edge variants are multimodal and run on a phone. These are not toys, they are the right tool for the 80 percent. The frontier model earns its price on multi-step reasoning, long-document synthesis, and open-ended agent work where the inputs are wide and unpredictable and 80 percent accuracy is not good enough.&lt;/p&gt;

&lt;p&gt;This is also why I am wary of two common reactions to the cost news. The first is "wait for prices to drop more," which misreads the paradox entirely, because your bill is driven by how many calls your design makes, not by the price of one call. The second is "just use the most expensive model for everything to be safe," which is how you turn a 2-cent task into a 20-cent one at scale for no quality gain. The discipline is matching model to job, and it is the same instinct behind treating &lt;a href="https://studiomeyer.io/en/blog/ai-model-resilience" rel="noopener noreferrer"&gt;model choice as a resilience decision&lt;/a&gt; rather than a brand loyalty. The agency that picks the right model for each step, and builds metering and routing in from the start, ends up with both lower costs and a system that does not fall over when one provider changes its terms.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;The cost of intelligence will keep falling, and your AI bill will keep being a real line item, and both of those will stay true at the same time. That is not a contradiction to resolve, it is the operating condition to design for. The teams that internalize it will build agentic products with budget caps, cascade routing, and a clear-eyed view of which model belongs on which step. The teams that wait for the technology to get cheap enough to stop thinking about cost will keep being surprised by their invoices, because the technology already got cheap and the surprise is structural.&lt;/p&gt;

&lt;p&gt;My prediction for the back half of 2026 is that "model strategy" becomes a normal part of any serious AI build, the way "database choice" is now, and that the wrapper-tax conversation gets loud. When a customer can see that their seat of tokens costs you 2 dollars, a flat 24-dollar plan starts to look like markup, and the products that survive will be the ones that separate the value they add from the inference they pass through. The cheap-model era did not make cost irrelevant. It moved cost from a price you look up to a decision you architect, and that is a better problem to have, as long as you actually treat it as one.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Matthias Meyer of &lt;a href="https://studiomeyer.io/en" rel="noopener noreferrer"&gt;StudioMeyer&lt;/a&gt;, a web and AI agency on Mallorca building MCP servers, agent fleets and AI products for small and mid-size businesses. This article was &lt;a href="https://studiomeyer.io/en/blog/ai-cost-paradox-2026" rel="noopener noreferrer"&gt;originally published on the StudioMeyer blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>business</category>
    </item>
    <item>
      <title>Claude Design: What It Is, Where It Fits, and When to Skip It</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Fri, 19 Jun 2026 23:28:19 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/claude-design-what-it-is-where-it-fits-and-when-to-skip-it-593d</link>
      <guid>https://dev.to/studiomeyer_io/claude-design-what-it-is-where-it-fits-and-when-to-skip-it-593d</guid>
      <description>&lt;p&gt;Claude Design is a tool from Anthropic Labs that turns a conversation into editable visual work: prototypes, slide decks, one-pagers, mockups, landing-page concepts. You describe what you want, Claude builds a first version you can see immediately, and you refine it by talking, leaving comments, dragging elements around, or moving sliders Claude builds for you. When it is ready, you send it to Canva, Adobe, Figma, PowerPoint, PDF, or straight into code.&lt;/p&gt;

&lt;p&gt;It launched on April 17, 2026 and passed a million users in the first week after its June overhaul. I run a design and AI studio on Mallorca, and within days of each update a client asked some version of the same question: should we be using this? The honest answer is yes, for specific jobs, and no as a replacement for the things you already have working. This is the guide I wish someone had handed me, written from using it on real work rather than from the launch post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Untangle three things people call "Claude design"
&lt;/h2&gt;

&lt;p&gt;Most of the confusion online comes from one word covering three different things. They are related, but using the right one for the right job is the whole game.&lt;/p&gt;

&lt;p&gt;The first is &lt;strong&gt;Claude Design, the product&lt;/strong&gt;. A standalone surface, with its own web address and a panel in the Claude desktop app, where Claude renders designs live next to your chat. It writes the underlying HTML and CSS, so what you see is real and not a flat picture. This is the thing people mean when they say "Claude Design."&lt;/p&gt;

&lt;p&gt;The second is &lt;strong&gt;the creative connectors&lt;/strong&gt;. Separate integrations that bring real design tools into any Claude conversation. The Adobe for creativity connector ships more than 50 tools from Photoshop, Illustrator, Lightroom, InDesign, Express, Premiere, and Firefly. There are connectors for Canva and for Figma too. With these, Claude can edit your photos, build an Adobe Express document, or read a Figma file, without you opening any of those apps.&lt;/p&gt;

&lt;p&gt;The third is &lt;strong&gt;Claude Code&lt;/strong&gt;, the terminal and editor tool that writes and ships actual production code. This is where a design becomes a live website. It is a different job from the first two, and the line between them matters more than you would think.&lt;/p&gt;

&lt;p&gt;When someone tells you "Claude replaced my design tool," ask which of the three they mean. Usually it is the first.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Claude Design can actually make
&lt;/h2&gt;

&lt;p&gt;The sweet spot is anything mostly visual and mostly self-contained: pitch decks and presentations exported to PowerPoint or sent to Canva, one-pagers and leave-behinds, product mockups and wireframes, landing-page concepts you want to look at before anyone writes code, email templates as clean HTML, and dashboards. Because it renders real code under the hood, it can also do things a slide tool cannot, like prototypes with motion, video, or 3D. That is useful for showing an idea, less useful when you need a finished, printable file.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works in practice
&lt;/h2&gt;

&lt;p&gt;The loop is describe, refine, export.&lt;/p&gt;

&lt;p&gt;You start from a text prompt, or you upload documents (Word, PowerPoint, Excel), or you point Claude at your live website with a capture tool so the mockup looks like your real product. Claude generates a first pass. You then refine it through normal conversation, by leaving inline comments on the design, by dragging and resizing elements on the canvas, or with sliders Claude generates for things like spacing and color.&lt;/p&gt;

&lt;p&gt;The part that earns its keep for teams is the design system. During setup, Claude reads your codebase and your existing design files and learns your colors, fonts, and components. After that, every project comes out on brand by default. The June 2026 update went further. You can import a design system from a GitHub repo or uploaded files, and Claude checks its own output against that system and corrects itself before you ever see it. Larger teams can lock a single approved system so nothing off brand gets produced.&lt;/p&gt;

&lt;p&gt;When you are done, you export. At launch that meant Canva, PDF, PowerPoint, HTML, or a shareable internal link. As of mid-June it also sends work to Adobe, Figma, Miro, Gamma, Vercel, Wix, Replit, and more, plus a one-click path into Adobe Experience Manager and Journey Optimizer for teams that publish through Adobe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to use it, and where not to
&lt;/h2&gt;

&lt;p&gt;This is the section that actually saves money. The trap is treating Claude Design as a tool that replaces everything. It does not. It is one tool in a stack, and the skill is knowing which job goes where.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;th&gt;Reach for&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Internal pitch deck or one-pager&lt;/td&gt;
&lt;td&gt;Claude Design&lt;/td&gt;
&lt;td&gt;On-brand draft in minutes, export to PPTX or Canva&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Landing-page concept before a build&lt;/td&gt;
&lt;td&gt;Claude Design, then Claude Code&lt;/td&gt;
&lt;td&gt;See it, agree on it, then build it for real&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A flyer or menu a client edits later&lt;/td&gt;
&lt;td&gt;Claude Design to Canva&lt;/td&gt;
&lt;td&gt;The client owns and edits the file afterward&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily social posts and carousels&lt;/td&gt;
&lt;td&gt;Your existing tool&lt;/td&gt;
&lt;td&gt;If it already works, do not rip it out&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New hero or generated imagery&lt;/td&gt;
&lt;td&gt;A dedicated image model&lt;/td&gt;
&lt;td&gt;The in-chat connector edits photos, it does not generate them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Photo cleanup, background removal, vectorizing a logo&lt;/td&gt;
&lt;td&gt;Adobe connector in Claude&lt;/td&gt;
&lt;td&gt;Pro edits without opening Photoshop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The actual production website&lt;/td&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Real repo, real build, real deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Brand source of truth, complex app UI&lt;/td&gt;
&lt;td&gt;Figma&lt;/td&gt;
&lt;td&gt;Still the system of record for serious product design&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: use Claude Design to start things and explore them, use your specialist tools for the work they are already good at, and use code for anything that has to ship as a real product.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does better than a plain Claude chat, and when to skip it
&lt;/h2&gt;

&lt;p&gt;If you already work with Claude in a chat or in Claude Code, this is the part worth being clear about.&lt;/p&gt;

&lt;p&gt;Claude Design beats a plain chat when you need to see and shape the thing visually. In a chat you describe a layout and get a description back, or a wall of code. In Claude Design you get a canvas you can point at, drag, and adjust without reading a single line of HTML. It is built for the moment when the output needs to be a design file someone else can open and edit, especially someone who does not code. That is the real unlock. It gives non-designers a way to produce decent visual work, and it gives designers a way to explore ten directions in the time one used to take.&lt;/p&gt;

&lt;p&gt;You should skip it and go straight to Claude Code when the destination is production. If the thing you are making is a website or an app that lives in a real codebase, a design file is a detour. We build client sites directly in code with Claude, run the tests, and deploy. There is no design-tool round trip because the code is the product.&lt;/p&gt;

&lt;p&gt;The interesting case is the handoff between the two, which Anthropic shipped in June. You can explore a page in Claude Design and then hand the whole thing to Claude Code, which picks up exactly where you left off, with no screenshot and no rebuild from scratch. For anyone who does both design and development, that seam is the actual headline. It means "let me see it first" and "now build it for real" stop being two disconnected worlds.&lt;/p&gt;

&lt;p&gt;One honest note for solo operators and very small teams. A lot of the published cost-benefit math assumes an eight-person team saving designer and developer hours. If you are one person, that math does not transfer. What transfers is the routing logic. Knowing which job belongs in which tool is worth more than any subscription you add.&lt;/p&gt;

&lt;h2&gt;
  
  
  The connectors, briefly
&lt;/h2&gt;

&lt;p&gt;Even if you never open the Claude Design product, the connectors are worth knowing about, because they work inside a normal Claude conversation.&lt;/p&gt;

&lt;p&gt;The Adobe connector is the strongest. With an Adobe account connected, Claude can build an editable Adobe Express document from a description, run real photo edits (adjust light and color, remove or blur a background, crop, vectorize a logo to clean SVG, extend a canvas), lay out documents in InDesign, and trim and clean up video. What it cannot do in this setting is generate brand-new images from a prompt, replace a background by description, or upscale. Those still need the full apps. So think of it as professional editing on tap, not an image generator.&lt;/p&gt;

&lt;p&gt;The Canva connector turns a Claude design into a fully editable Canva file, which is the cleanest way to hand something to a client or teammate who lives in Canva. The Figma integration is mostly a bridge. It reads design files and turns them into code, and it can turn built interfaces back into editable Figma frames. If your Figma seat is view-only, treat it as a one-way street into code rather than a place to design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;It is still a research preview, and it shows. The model, the editor, and the export list have all changed twice in two months, so anything you read about it has a short shelf life, including parts of this guide. Generating live designs burns more tokens than chatting, and even after June's efficiency work, heavy use eats into your plan. Anything you export to a static format like a PDF or an Express document loses motion and interactivity, because those formats flatten to a single frame. And the brand-system magic is only as good as the system you feed it, which means an afternoon of setup before the output is genuinely on brand.&lt;/p&gt;

&lt;p&gt;None of these are dealbreakers. They are the normal cost of using something new and fast moving, and worth knowing before you build a workflow on top of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger picture
&lt;/h2&gt;

&lt;p&gt;Step back and the strategy is clear. Anthropic is not trying to win by being the best canvas. It is positioning Claude Design as the place where visual work begins, then connecting it to everywhere that work needs to go: Canva, Adobe, Figma, your codebase, your content system. The design system you import is the same component library Claude Code uses to build. A model you sketch in one Claude tool can flow into a deck in another and out to PowerPoint. For a small business the practical version is simple. The brand assets you make can move to wherever your team already works, and the concept you explore visually can become a real website without starting over. If you want the wider Claude picture, we mapped the whole ecosystem in our &lt;a href="https://studiomeyer.io/en/blog/claude-guide-2026" rel="noopener noreferrer"&gt;Claude in 2026 guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That is the part I find genuinely useful. Not "AI replaces designers," but the gap between an idea and a shipped thing getting a lot shorter.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is Claude Design?&lt;/strong&gt;&lt;br&gt;
A tool from Anthropic Labs that creates editable visual work, including prototypes, slides, one-pagers, and mockups, from a conversation, then exports it to tools like Canva, Adobe, Figma, PowerPoint, and PDF. It launched on April 17, 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Claude Design free?&lt;/strong&gt;&lt;br&gt;
It is included with paid Claude plans (Pro, Max, Team, and Enterprise) and uses your existing plan limits, with an option to enable extra usage. On Enterprise it is off by default until an admin turns it on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can Claude Design build a real website?&lt;/strong&gt;&lt;br&gt;
It builds the design and the underlying HTML, and it can hand off to Claude Code to turn that into a production site. For anything that has to ship in a real codebase, the code gets built in Claude Code, not exported from the design tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Design vs Figma vs Canva: which should I use?&lt;/strong&gt;&lt;br&gt;
Use Claude Design to start and explore (decks, one-pagers, landing-page concepts, email templates). Keep Figma as your brand source of truth and for complex app UI. Use Canva for fast everyday social graphics and video. They overlap, but each still has a job it does best.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can Claude generate images?&lt;/strong&gt;&lt;br&gt;
The in-chat Adobe connector edits images, it does not generate them from scratch. For brand-new generated imagery you still use a dedicated image model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need an Adobe or Canva subscription?&lt;/strong&gt;&lt;br&gt;
You can use the connectors with a connected account for higher limits and saved work. Basic use works without a paid creative subscription, but a brand kit or pro account makes the output better and on brand.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://studiomeyer.io/en/blog/claude-design-guide" rel="noopener noreferrer"&gt;studiomeyer.io&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claude</category>
      <category>anthropic</category>
      <category>ai</category>
      <category>design</category>
    </item>
    <item>
      <title>Self-Evolving AI Agents: The Optimizer Is the Easy Part</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Fri, 19 Jun 2026 19:07:23 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/self-evolving-ai-agents-the-optimizer-is-the-easy-part-3i84</link>
      <guid>https://dev.to/studiomeyer_io/self-evolving-ai-agents-the-optimizer-is-the-easy-part-3i84</guid>
      <description>&lt;p&gt;There are two kinds of AI agent in production right now. The first one you babysit. You tweak its system prompt, watch it fail on a new kind of task, tweak it again, and the prompt slowly turns into a wall of special cases nobody wants to touch. The second kind notices the failure on its own, writes a better version of its own prompt, tests that version against real work, and keeps it only if it actually wins. The gap between those two is the whole field of self-evolving agents, and this year it stopped being a research curiosity.&lt;/p&gt;

&lt;p&gt;A self-evolving agent is just an agent wrapped in a feedback loop. The agent runs a task. Something scores the output. When a weakness shows up often enough, the system proposes a new system prompt, runs both the old and the new version on real traffic for a while, and promotes the winner. If the new version turns out worse, it rolls back to the last known good one. No human in the path, but also no leap of faith, because nothing gets promoted until it earns it.&lt;/p&gt;

&lt;p&gt;That is the idea. The interesting part is which piece is actually hard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Optimizer Got Solved This Year
&lt;/h2&gt;

&lt;p&gt;The piece everyone writes papers about is the mutation step: given a prompt that is underperforming, produce a better one. For years the serious answer was reinforcement learning, which adjusts a model from sparse numerical rewards. It works, but it is expensive and it treats a rich failure as a single number.&lt;/p&gt;

&lt;p&gt;In 2026 the field converged on a different answer. &lt;a href="https://arxiv.org/abs/2507.19457" rel="noopener noreferrer"&gt;GEPA&lt;/a&gt;, short for Genetic-Pareto, was accepted at ICLR as an oral and it makes a blunt argument: language is a richer teacher than a scalar reward. Instead of nudging weights from a number, GEPA reads the actual trajectory of a run, the reasoning and the tool calls and the output, then reflects on it in plain language to diagnose what went wrong and writes the smallest edit that fixes it. It keeps a Pareto frontier of candidates that each win on different cases and combines their strengths.&lt;/p&gt;

&lt;p&gt;The numbers are the reason people paid attention. GEPA beats GRPO, a strong reinforcement learning method, by about 6 percent on average and by as much as 20 percent, while using up to 35 times fewer rollouts. It also beats MIPROv2, the previous prompt-optimization workhorse, by more than 10 percent. Fewer expensive runs, better results, and no reinforcement learning machinery to stand up. That combination is why GEPA spread fast and why it now ships inside DSPy, the most popular optimization framework.&lt;/p&gt;

&lt;p&gt;So the optimizer is, for practical purposes, solved. Which is exactly why it is the easy part.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Part Is Everything Around It
&lt;/h2&gt;

&lt;p&gt;Read the GEPA work closely and you notice it optimizes offline. It takes a training set, runs rollouts against it, reflects, and hands you a better prompt. What it does not do is tell you whether that prompt is safe to put in front of real users, watch it on live traffic, or undo it when it quietly regresses next Tuesday. Those are not flaws in GEPA. They are simply a different job.&lt;/p&gt;

&lt;p&gt;The team at Decagon &lt;a href="https://decagon.ai/blog/optimizing-gepa-for-production" rel="noopener noreferrer"&gt;wrote up what it actually took&lt;/a&gt; to run GEPA on a production classifier, and the write-up is more useful than the paper for anyone shipping. Three findings stand out. The reflection model has to be a frontier model. They found that smaller models "completely fail at prompt optimization," with GPT-4o-mini producing no change at all, because, as they put it, prompt optimization is reasoning about reasoning. More data is not better. Their sweet spot was 20 to 100 examples, and pushing to 500 made the prompt balloon while performance dropped, overfitting to edge cases instead of learning the general rule. And the default implementation does not constrain prompt length, so they had to build that themselves before a runaway prompt ate their context window.&lt;/p&gt;

&lt;p&gt;Then, only after a candidate cleared offline thresholds, they ran it through a controlled A/B rollout with real customers, increasing traffic to the new version gradually. That last sentence is the whole point of this article. The optimizer is one component. Around it sits an evaluation harness, a gate that decides whether a candidate is allowed to ship, a rollback path for when it is not, length and safety constraints on the mutation, and the plumbing to keep all of this running online instead of as a one-off batch job. That surrounding layer is where the reliability lives, and it is almost never the thing that gets a paper.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pieces of a Self-Evolving Loop
&lt;/h2&gt;

&lt;p&gt;It helps to name the parts, because a production loop is really an assembly of small, boring jobs that each do one thing.&lt;/p&gt;

&lt;p&gt;A scorer, or critic, turns an output into a number. A single LLM grading another LLM is biased toward its own style, so the more robust pattern is several critics with different criteria, or even different model providers, taking a median. The score is only as trustworthy as the panel that produced it.&lt;/p&gt;

&lt;p&gt;A pattern detector watches scores over time and decides when a real weakness exists, as opposed to one bad run. It is the difference between reacting to noise and reacting to a trend.&lt;/p&gt;

&lt;p&gt;The optimizer is the GEPA-style reflector described above. It is the part that writes the new prompt, and it is the part everyone fixates on.&lt;/p&gt;

&lt;p&gt;A safety gate is the adult in the room. Before a new prompt is allowed to take over, the gate runs it head to head against the incumbent, checks that the improvement is real and not a coin flip, and refuses to promote a version that regresses past a threshold. Pair it with automatic rollback and a record of the last known good prompt, and a bad mutation costs you a few runs instead of a weekend.&lt;/p&gt;

&lt;p&gt;An experiment tracker remembers every run, every score, and every prompt version, so the loop has a memory and so you can audit why a given prompt is live. Without it you are evolving blind.&lt;/p&gt;

&lt;p&gt;None of these is glamorous. All of them are load-bearing. Strip the gate and the rollback out and you do not have a self-evolving agent, you have an agent that mutates its own prompt with no seatbelt, which is a worse agent than the one you started with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Has Been a Python-Only Story
&lt;/h2&gt;

&lt;p&gt;Here is the gap that started this for me. Every serious prompt optimizer in 2026 is written in Python. DSPy, GEPA's own reference implementation, TextGrad, AdalFlow, Microsoft's PromptWizard. If your agents run in a Python data-science stack, you are spoiled for choice. If they run in TypeScript, which is where an enormous share of real production agents actually live, there has been nothing. Not a thin port, nothing.&lt;/p&gt;

&lt;p&gt;That is the gap &lt;a href="https://github.com/studiomeyer-io/darwin-agents" rel="noopener noreferrer"&gt;darwin-agents&lt;/a&gt; exists to fill. It is an open-source TypeScript library, MIT licensed, that gives an agent the whole loop rather than just the optimizer: multi-model critics, A/B testing, the safety gate, automatic rollback, and experiment tracking, with the optimizer as one swappable piece inside it. The design bet is the same as this article. The optimizer is the part you can borrow from research. The production layer is the part you have to build, so build that well and make the optimizer pluggable.&lt;/p&gt;

&lt;p&gt;Its latest release closes the obvious loop. Until now the library shipped a GEPA-style reflective optimizer as something you could call yourself, and a separate safety-gated evolution loop, but the two were not wired together. The loop still used a simpler optimizer. The new version connects them, so the reflective optimizer now runs inside the production gate instead of as an offline script. As far as I can find, that specific combination, a GEPA-style optimizer evolving prompts live behind a safety gate, in TypeScript, does not exist anywhere else yet. It is opt-in, so existing agents behave exactly as before until you turn it on, and it is still alpha, so treat it like alpha.&lt;/p&gt;

&lt;p&gt;If you want to see the surrounding ideas applied to a fleet rather than a single agent, the same logic shows up in &lt;a href="https://studiomeyer.io/en/blog/automl-for-agent-fleets-without-the-vendor-bill" rel="noopener noreferrer"&gt;tuning a whole agent fleet without a vendor bill&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When You Actually Want This
&lt;/h2&gt;

&lt;p&gt;Self-evolution earns its complexity when three things are true at once. You have enough traffic that an A/B test can reach a verdict in reasonable time, because a loop that never gathers enough data to decide is just overhead. The task has a measurable notion of good, because a critic needs something to score. And the cost of a wrong mutation is recoverable, which is exactly what the gate and rollback guarantee.&lt;/p&gt;

&lt;p&gt;When those are not true, a self-evolving agent is the wrong tool, and a human tweaking a prompt now and then is genuinely better. Honesty about that boundary is part of using the technique well. The failure mode of this whole field is a team that turns on automatic evolution for an agent that runs ten times a week against a fuzzy goal, then wonders why the prompt drifts into nonsense. The loop is only as good as the signal feeding it.&lt;/p&gt;

&lt;p&gt;The optimizer race is mostly over and GEPA won it for now. The next two years of real work are in the layer nobody is racing on: the gate, the rollback, the evaluation, and the unglamorous job of running all of it safely while users are watching. That is the part that decides whether a self-evolving agent is a liability or the most reliable thing in your stack.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Your AI Model Can Vanish Overnight. Build For That.</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Mon, 15 Jun 2026 16:53:37 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/your-ai-model-can-vanish-overnight-build-for-that-22pj</link>
      <guid>https://dev.to/studiomeyer_io/your-ai-model-can-vanish-overnight-build-for-that-22pj</guid>
      <description>&lt;p&gt;Last night the model I was working in stopped existing. Not slowed down, not rate-limited. I asked the tool to do something routine and it answered that the model "may not exist, or you may not have access to it." A few minutes later the news caught up: Anthropic had suspended Claude Fable 5 and Mythos 5 worldwide, the same evening, on a directive from the US government.&lt;/p&gt;

&lt;p&gt;The work did not stop. I switched to Claude Opus 4.8 and kept going, because none of what mattered lived inside Fable. It lived in a memory layer and a git history that any capable model can pick up. That gap, between "the model vanished" and "the work paused," is the entire subject of this post. For most teams running on a single AI model today, that gap is zero. The model goes, the work goes with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happened
&lt;/h2&gt;

&lt;p&gt;On June 12, 2026, at 5:21pm Eastern, Anthropic received an &lt;a href="https://www.anthropic.com/news/fable-mythos-access" rel="noopener noreferrer"&gt;export-control directive&lt;/a&gt; instructing it to suspend access to Fable 5 and Mythos 5 for "any foreign national, whether inside or outside the United States." Because that scope is impossible to enforce selectively, including against the company's own foreign employees, Anthropic disabled both models for everyone. The stated justification was national security. By Anthropic's own account, the letter "did not provide specific details," and the concern traces to what the company calls a "narrow potential jailbreak" involving asking the model to read a codebase and identify software flaws.&lt;/p&gt;

&lt;p&gt;Anthropic pushed back in public, which is unusual. The company wrote that it disagrees "that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people," and warned that the same standard "would essentially halt all new model deployments for all frontier model providers." It also said all other Claude models, Opus 4.8 included, are unaffected, and that it is "working to restore access as soon as possible."&lt;/p&gt;

&lt;p&gt;I read that last line as a signal that Fable comes back. The dispute looks narrow and the company is fighting it openly. But notice that the return date is not Anthropic's to set. That is the part worth sitting with. The model you build on can now be switched off by a third party with no notice and no timeline, and the vendor agrees with you that it is unreasonable and still cannot do anything about it tonight.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Is Not a One-Off
&lt;/h2&gt;

&lt;p&gt;It is tempting to file a government directive under freak event. The shutdown was unusual. The disappearance was not.&lt;/p&gt;

&lt;p&gt;Models are retired on a schedule now. OpenAI pulled GPT-4o on April 3, 2026, an announcement that affected roughly 800,000 weekly users, with the Assistants API following in August. Anthropic deprecated Claude 3.7 Sonnet in November 2025 and shut it down on May 11, 2026. Claude 3 Haiku is on the same path for August. Across the industry the support window for a given model has compressed from eighteen or twenty-four months down to somewhere between six and twelve. Late last year a single quota change cut some Google API users by around 80 percent, turning working production systems into "resource exhausted" error loops overnight.&lt;/p&gt;

&lt;p&gt;So a model leaving your stack is not the exception. It is the default, on a clock you do not control, and the Fable case just added a faster and less predictable way for it to happen. People have started calling this vendor lock-out, to separate it from the slow lock-in we already knew about. Lock-in is the cost of leaving. Lock-out is leaving deciding for you.&lt;/p&gt;

&lt;p&gt;If your product, your internal tooling, or your client work assumes a specific model name will answer when you call it, you have a single point of failure that the last six months have repeatedly proven is not reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resilience Is Architecture, Not Model Choice
&lt;/h2&gt;

&lt;p&gt;The instinct after a shutdown is to ask which model is safest to bet on. That is the wrong question. There is no safe single bet, because the risk is not in the model, it is in the dependency. The teams that shrugged off last night were not the ones who picked correctly. They were the ones who built so that picking did not matter much.&lt;/p&gt;

&lt;p&gt;Three things separate a stack that survives a model vanishing from one that goes dark with it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An abstraction layer.&lt;/strong&gt; Your application logic should talk to a thin internal interface, not directly to one vendor's SDK. When a model disappears you change a configuration value, not your codebase. Teams that built this from the start report adding or switching a provider with a fraction of the migration effort of those wired directly into one API. This is unglamorous plumbing and it is the single highest-leverage decision you will make.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A portable memory layer.&lt;/strong&gt; This is the one that saved me last night. The context that makes an AI assistant useful, what the project is, what was decided last week, what the customer prefers, has to live outside the model, in a store that any model can read. If your accumulated context lives only in a vendor's chat history or a proprietary fine-tune, then losing the model means losing the institutional memory with it. Keep state in something portable and the model becomes a swappable engine rather than the vault.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A tested fallback.&lt;/strong&gt; A second model you have actually run your real workload against, not one you assume will work. There is a large difference between "we could switch" and "we have switched." The first is a hope. The second is a runbook. The fallback does not need to be as strong as your primary, it needs to keep the lights on while you sort out the primary.&lt;/p&gt;

&lt;p&gt;None of this is exotic. It is the same discipline that any business eventually learns about payment processors, hosting providers, and suppliers. You do not run a restaurant with one vegetable wholesaler who can stop answering the phone with no notice. AI has felt different because the tools are new and the lock-in forms invisibly, in twelve to eighteen months, before anyone notices it happened. The Fable shutdown just made the invisible visible for one night.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means If You Run a Small Business
&lt;/h2&gt;

&lt;p&gt;The enterprises will be fine. They have procurement teams and secondary contracts and the budget to run two providers in parallel. The exposure sits with smaller operators, the agency that wired a client's whole support flow to one model, the founder whose product is a wrapper around a single API, the consultant whose entire delivery depends on one subscription staying live.&lt;/p&gt;

&lt;p&gt;You do not need an enterprise budget to be resilient. You need three habits. Keep your prompts and logic behind an interface you control. Keep your data and context in a format you own and can export today. And know, concretely, what you would do in the hour after your primary model goes away, because at some point this year you will find out whether you knew or only assumed.&lt;/p&gt;

&lt;p&gt;We build this way for our own systems and for clients, not because we predicted a government directive, but because the deprecation calendar alone made it obvious. Last night turned a design principle into a live test, and the test passed for a boring reason: nothing important was trapped inside the model that left.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part That Stays True Even After Fable Returns
&lt;/h2&gt;

&lt;p&gt;Fable 5 will most likely be back, possibly before this post is a week old. When it is, the temptation will be to treat last night as a strange interruption that resolved itself and move on. That would be the expensive lesson to skip.&lt;/p&gt;

&lt;p&gt;The specific cause was unusual. The shape of it was not, and the shape is what repeats. A capability your work depends on can be removed by a decision you are not part of, on a timeline you cannot see, by a vendor who may even agree with you and still be unable to help in the moment. That is now a permanent feature of building on frontier AI, not a glitch in it.&lt;/p&gt;

&lt;p&gt;The right response is not to distrust any one provider. It is to stop treating any single model as infrastructure you can lean your weight on. Treat models as what they have become, fast-moving, powerful, and temporary, and build the durable part yourself, in the layer underneath them that you actually own. Do that and the next time a model vanishes, it costs you an hour and a slightly annoyed afternoon. Skip it, and it costs you the part of your business you assumed would always answer.&lt;/p&gt;

&lt;p&gt;So here is the test worth running this week, before the news cycle moves on. Could you switch your primary model tonight, with no warning, and lose nothing but a little time? If the answer is yes, last night was someone else's emergency. If the answer is no, you just learned where the work is.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>anthropic</category>
      <category>programming</category>
    </item>
    <item>
      <title>Claude Fable 5 Is Two Models Wearing One Name</title>
      <dc:creator>Matthias | StudioMeyer</dc:creator>
      <pubDate>Tue, 09 Jun 2026 21:43:02 +0000</pubDate>
      <link>https://dev.to/studiomeyer_io/claude-fable-5-is-two-models-wearing-one-name-2jdc</link>
      <guid>https://dev.to/studiomeyer_io/claude-fable-5-is-two-models-wearing-one-name-2jdc</guid>
      <description>&lt;p&gt;On June 9, 2026, Anthropic shipped the most capable model it has ever released to the public. The most interesting thing about it is the part that sometimes refuses to talk to you.&lt;/p&gt;

&lt;p&gt;Claude Fable 5 is the first model from what Anthropic calls its Mythos class, a tier that now sits above Opus. It launched as a pair. Fable 5 is the public version. Claude Mythos 5 is the same underlying model with its guardrails loosened, and it is not for sale to most of us. It goes only to vetted cyberdefenders and infrastructure providers through a program called Project Glasswing, in collaboration with the US government. Two names, one brain. The thing that separates them is a set of classifiers.&lt;/p&gt;

&lt;p&gt;That detail is the whole story, and almost every launch-day write-up buried it under the benchmark chart. So let me start there instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Model, Two Names, One Classifier in Between
&lt;/h2&gt;

&lt;p&gt;Fable 5 ships with three classifiers running alongside it. They watch for requests about offensive cybersecurity, about biology and chemistry that edge toward weapons, and about distillation, which is using the model to train a competitor. When a classifier fires, Fable 5 does not answer. The request gets handed to Claude Opus 4.8, the model that was the top of the public stack until that morning, and Opus answers in Fable's place.&lt;/p&gt;

&lt;p&gt;For anyone building on the API, this is not an abstract safety story. It is a response shape you have to handle. A refused request comes back as &lt;code&gt;stop_reason: "refusal"&lt;/code&gt; with a normal HTTP 200, not an error, and it tells you which classifier tripped. You can have the API retry on another model with a &lt;code&gt;fallbacks&lt;/code&gt; parameter, or do it client side with the SDK middleware. You are not billed for a request that is refused before it generates output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stop_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"refusal"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stop_sequence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Anthropic says this is rare. Its early numbers put at least 95 percent of Fable sessions running entirely on Fable's own answers. I believe that for general work. But "rare on average" and "rare for your workload" are different claims. If you build security tooling, parse exploit write-ups, or do biochemistry, you live closer to the classifier's tripwire than the average user, and your effective experience is a quieter, cheaper model with a more expensive bill. Worth knowing before you point a production pipeline at it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Benchmark Lead Is Real and Narrower Than It Looks
&lt;/h2&gt;

&lt;p&gt;The headline number is genuine. On SWE-bench Pro, the hard agentic coding benchmark, Fable 5 scores 80.3 percent. Opus 4.8 sits at 69.2, GPT-5.5 at 58.6, and Gemini 3.1 Pro at 54.2. That is an eleven point lead over Anthropic's own previous best and more than twenty over the strongest general model from OpenAI. On Cognition's FrontierCode Diamond it roughly doubles Opus. These are not rounding errors. For long, multi-step coding work, this is the widest gap between frontier models I have seen in a single generation.&lt;/p&gt;

&lt;p&gt;Then look at the second number Anthropic published and almost nobody quoted. On SWE-bench Verified, Fable 5 scores 95.0 and Mythos 5 scores 95.5. Same model, half a point apart. The gap is not capability. It is Fable's safety fallback occasionally kicking a coding task over to Opus. That half point is the price of the guardrails, measured.&lt;/p&gt;

&lt;p&gt;So the lead is real, but it is concentrated. Agentic coding, tool use, long-context reasoning, finance, vision. Anthropic reports the first score above 90 percent on Hex's analytics suite and the top mark on Hebbia's finance benchmark. As a vendor proof point it cites Stripe running Fable 5 across a 50-million-line Ruby codebase and finishing a migration in a day that it estimated would have taken a team more than two months by hand. Impressive, and also exactly the kind of single-customer number that should make you want to run your own test before you believe it about your codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Costs, and the June 22 Catch
&lt;/h2&gt;

&lt;p&gt;Fable 5 costs 10 dollars per million input tokens and 50 per million output. That is exactly double Opus 4.8, which is 5 and 25. It is also less than half what the restricted Mythos Preview cost earlier in the year, so on its own terms the price came down. It carries a 1M token context window and up to 128k output tokens, and it is a Covered Model, which means a 30-day data retention requirement and no zero-retention option. If your contract assumes zero retention, this model does not fit it.&lt;/p&gt;

&lt;p&gt;There is a calendar catch that matters more than the sticker price. From launch through June 22, Fable 5 is included at no extra cost on the Pro, Max, Team, and Enterprise plans. From June 23, using it on those plans draws from usage credits. Anthropic frames this as a capacity measure and says it intends to fold Fable back into the flat subscription later, with no date attached. So the free fortnight is a real window to test, and the steady-state cost is a credit meter. Plan accordingly rather than wiring your daily driver to it and getting surprised in two weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Safeguard Is the Product Decision
&lt;/h2&gt;

&lt;p&gt;Here is the part I keep coming back to. The classifier is not a footnote on a powerful model. It is the product. Anthropic built one model and shipped two postures of it, and the entire public release exists because the safeguards let them feel comfortable handing this much capability to everyone. The benchmark chart is the marketing. The refusal-and-fallback machinery is the actual launch.&lt;/p&gt;

&lt;p&gt;That framing also explains the timing that several outlets pointed at. Five days before this release, on June 4, Anthropic published a piece called "When AI Builds Itself," warning that models may be approaching recursive self-improvement and floating a coordinated mechanism for the industry to slow or pause frontier development. Reuters, Scientific American, and others covered it. Then on June 9 the same company shipped the most powerful model the public has ever been able to touch. Critics read that as strategy, a way to invite regulation onto a track Anthropic is winning. Maybe. The more grounded reading is that the two events are the same statement. The slowdown essay and the classifier-gated release are both Anthropic saying the capability is now past the point where you ship it raw. You can find that convincing or self-serving. Either way, the safeguard is no longer a wrapper on the product. It is the shape of the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Model Was Rarely Your Bottleneck
&lt;/h2&gt;

&lt;p&gt;Now the unpopular part. For most of the systems people actually run, swapping in Fable 5 will change less than the benchmark gap suggests.&lt;/p&gt;

&lt;p&gt;A single-blind study made the rounds earlier this year where the model behind an assistant was swapped without users noticing, and the measured difference in outcomes was not statistically significant. That matches what we see building real systems. Once you are past a capable baseline, and Opus 4.8 and Sonnet 4.6 are well past it, the thing that decides whether your assistant is good is rarely the model tier. It is whether it has the right context in front of it. What it remembers across sessions. How well it retrieves the right document. Whether the tools it calls return clean data. The &lt;a href="https://studiomeyer.io/en/services/memory" rel="noopener noreferrer"&gt;AI memory systems we build&lt;/a&gt; move the needle on those systems far more than a model upgrade does, because the model was answering the wrong question well, not the right question badly.&lt;/p&gt;

&lt;p&gt;This is not an argument against Fable 5. It is an argument about where to spend. If your agent forgets the customer between turns, a model that is eleven points better at SWE-bench will forget them eleven points more eloquently. Fix the context first. Then, on the genuinely hard reasoning tasks where you have already done that work, reach for the stronger model and feel the difference. I wrote &lt;a href="https://studiomeyer.io/en/blog/claude-guide-2026" rel="noopener noreferrer"&gt;a longer field guide to the whole Claude lineup&lt;/a&gt; if you want the map of which model fits which job.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Reach for Fable 5, Opus 4.8, or Sonnet
&lt;/h2&gt;

&lt;p&gt;The honest decision tree is short.&lt;/p&gt;

&lt;p&gt;Reach for Fable 5 on the hard agentic work where its lead is real and the task is worth double the token bill. Large refactors across a big codebase, long autonomous tool chains, dense document and financial reasoning, anything where a marginally better answer compounds over many steps. Test it free before June 23, then treat it as the tool you pull out for the hard cases, not the one that runs every request.&lt;/p&gt;

&lt;p&gt;Stay on Opus 4.8 as the everyday workhorse for agentic and coding work. It is half the price, it is what Fable falls back to anyway, and on most tasks the difference is small. If your work is security-flavored, Opus is also the more predictable choice, because Fable will route you there mid-task regardless and charge you for the detour.&lt;/p&gt;

&lt;p&gt;Stay on Sonnet 4.6 for the high-volume, latency-sensitive, or classification-shaped work where frontier reasoning is wasted. Most of the calls inside a well-built system are this kind. Routing, summarizing, extracting, ranking. Paying frontier prices for them is a common and expensive habit.&lt;/p&gt;

&lt;p&gt;Mythos 5, for almost everyone reading this, is not a choice. It is gated to Glasswing partners. The realistic move there is to watch the trusted-access program rather than wait for it.&lt;/p&gt;

&lt;p&gt;The launch that matters here is not that Anthropic crossed another benchmark. It is that the frontier now ships with a referee standing between you and the model, deciding in real time which Claude you are allowed to talk to. That is a new default, and it will be the normal shape of every powerful model from here. The teams that win the next year will not be the ones who switched to the highest number on the chart. They will be the ones who already fixed everything the model was never going to fix for them.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>anthropic</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
