<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tom Tokita</title>
    <description>The latest articles on DEV Community by Tom Tokita (@tomtokita).</description>
    <link>https://dev.to/tomtokita</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3840091%2F5ac3193c-0dc1-496a-b6d2-a7eb6e1556e7.jpg</url>
      <title>DEV Community: Tom Tokita</title>
      <link>https://dev.to/tomtokita</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tomtokita"/>
    <language>en</language>
    <item>
      <title>Most AI Tools Are Just LLM Wrappers. Here's What Actually Matters.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Tue, 19 May 2026 00:36:13 +0000</pubDate>
      <link>https://dev.to/tomtokita/most-ai-tools-are-just-llm-wrappers-heres-what-actually-matters-10mg</link>
      <guid>https://dev.to/tomtokita/most-ai-tools-are-just-llm-wrappers-heres-what-actually-matters-10mg</guid>
      <description>&lt;p&gt;&lt;strong&gt;In 2025, AI wrapper startups raised over $10 billion.&lt;/strong&gt; The product? Take an LLM API. Add a text box. Maybe some prompt templates. Charge $30/month. Call it "AI-powered."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not mad at the hustle.&lt;/strong&gt; But if your entire product disappears the moment ChatGPT adds your feature for free, you don't have a product. You have a timing play.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wrapper Test
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One question tells you everything:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Can you replicate the output by pasting the same input into ChatGPT or Claude?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If yes:&lt;/strong&gt; it's a wrapper. You're paying for UI and convenience, not intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If no:&lt;/strong&gt; because it's pulling from multiple data sources, applying domain logic, or integrating with real systems, it might be something real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most fail the test.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Thin vs. Thick
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Not all wrappers are equal.&lt;/strong&gt; The market is splitting fast:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Thin Wrapper&lt;/th&gt;
&lt;th&gt;Thick Wrapper&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it does&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;UI + API call + system prompt&lt;/td&gt;
&lt;td&gt;Real integrations, domain logic, data pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Defensibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None. One platform update kills it&lt;/td&gt;
&lt;td&gt;High. Value is in the connectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"AI email writer" (GPT call with a system prompt)&lt;/td&gt;
&lt;td&gt;Cursor (reads your codebase, understands project context)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Survival odds&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Decent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The graveyard of 2025–2026&lt;/strong&gt; is littered with thin wrappers that a platform update made irrelevant overnight.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Strip away the wrapper.&lt;/strong&gt; Where does the real value live?&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Connectors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The ability to talk to real systems:&lt;/strong&gt; Salesforce, Jira, databases, email, file storage, APIs. This is where 80% of the actual work lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting an AI to generate text is trivial.&lt;/strong&gt; Getting it to read your CRM records, cross-reference tickets, update a database, and notify Slack. That's integration work. That's hard. That's valuable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most wrappers don't touch this.&lt;/strong&gt; They live in the text-in, text-out world.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Captured Domain Expertise
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;An AI that's been learning your industry's quirks for months&lt;/strong&gt; is worth more than a fresh GPT-5 instance with a clever prompt.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Fresh AI + Great Prompt&lt;/th&gt;
&lt;th&gt;AI + 6 Months of Learnings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Platform quirks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Discovers them painfully&lt;/td&gt;
&lt;td&gt;Already knows them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Common mistakes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes them all&lt;/td&gt;
&lt;td&gt;Has guardrails for each&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Your terminology&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Constant correction needed&lt;/td&gt;
&lt;td&gt;Uses it naturally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Edge cases&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Surprised every time&lt;/td&gt;
&lt;td&gt;Documented patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The knowledge compounds.&lt;/strong&gt; Every session, every bug fix, every "oh, that's how this actually works" gets captured and fed back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No wrapper captures this.&lt;/strong&gt; They start fresh every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Methodology
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;How you approach problems with AI&lt;/strong&gt; matters more than which model you use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The wrapper approach:&lt;/strong&gt; open tool → type request → get output → hope it's right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The practitioner approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Small test:&lt;/strong&gt; constrained input, see what happens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate:&lt;/strong&gt; what worked? What broke?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture:&lt;/strong&gt; document the learning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjust:&lt;/strong&gt; update the approach&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repeat&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The tool is 10%. The methodology is 90%.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The "Just Build It" Case
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Here's the uncomfortable truth.&lt;/strong&gt; Building your own system (even ugly, even scrappy) gives you something no wrapper provides: &lt;strong&gt;understanding.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You know why it works.&lt;/strong&gt; Why it breaks. How to fix it. When the model changes (and it will), you swap the engine. The connectors, the learnings, the guardrails. Those persist. They're yours.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost at scale:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Wrapper Stack&lt;/th&gt;
&lt;th&gt;Custom (Direct API)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Month 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$150/seat, fast setup&lt;/td&gt;
&lt;td&gt;$500 dev time, slower start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Month 6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$150/seat, same capabilities&lt;/td&gt;
&lt;td&gt;$50/month API, growing capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Year 1 (5 seats)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$9,000&lt;/td&gt;
&lt;td&gt;~$3,100 + compound knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Custom costs less AND gets smarter.&lt;/strong&gt; The wrapper costs the same and stays the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Philippines advantage:&lt;/strong&gt; smaller teams with direct API access can outperform larger orgs paying for wrapper stacks. When you can't afford $150/seat for 6 different AI tools, you build one system that does what you need. That constraint produces better architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Wrappers DO Make Sense
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fair is fair:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed to market:&lt;/strong&gt; need something running tomorrow without engineering capacity? Wrapper gets you there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thick wrappers with real integrations:&lt;/strong&gt; Cursor, Harvey, Perplexity add genuine value beyond the API call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exploration phase:&lt;/strong&gt; trying 5 wrappers to understand the capability space before building your own is smart R&amp;amp;D.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key question:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are you buying a tool or renting a feature?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;If the value prop is "we make it easy to talk to an LLM,"&lt;/strong&gt; that feature is getting commoditized in real time. Every model provider is making their native interface better, faster, cheaper.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Build Instead
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ready to go beyond wrappers?&lt;/strong&gt; Start here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Map your connectors.&lt;/strong&gt; What systems does your AI need to talk to? Build those integrations first. Hardest part. Most valuable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Capture everything.&lt;/strong&gt; Every platform quirk. Every failed approach. Every successful pattern. Your AI should learn from your organization's experience, not start fresh every session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Own your methodology.&lt;/strong&gt; Document how you approach problems with AI. Small tests → captured learnings → iteration. More valuable than any tool you can buy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Accept ugly.&lt;/strong&gt; The most effective AI systems I've built are not pretty. Config files, markdown documents, scripts. They look like plumbing. They work like machines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The moat isn't the model.&lt;/strong&gt; It never was.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's the connectors&lt;/strong&gt; that talk to your stack. The domain expertise captured over months. The methodology that turns every failure into a lesson.&lt;/p&gt;

&lt;p&gt;None of that lives in a wrapper.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tom Tokita. I run &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt; out of Manila. We build production AI and Salesforce systems for enterprises that need real integrations, not another wrapper. &lt;a href="https://aether-global.com/contact" rel="noopener noreferrer"&gt;Let's talk.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Read next: &lt;a href="https://dev.to/blog/context-engineering-vs-prompt-engineering"&gt;Context Engineering: Why Your AI Strategy Needs Infrastructure, Not Better Prompts&lt;/a&gt; · &lt;a href="https://dev.to/blog/autonomous-ai-agents-production-cost"&gt;Autonomous AI Agents Look Great in Demos. Here's What They Cost in Production.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>webdev</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Truth About Agent Swarming: What the Gurus Won't Tell You About Cost, Failure, and Security</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Sat, 16 May 2026 11:15:26 +0000</pubDate>
      <link>https://dev.to/tomtokita/the-truth-about-agent-swarming-what-the-gurus-wont-tell-you-about-cost-failure-and-security-1775</link>
      <guid>https://dev.to/tomtokita/the-truth-about-agent-swarming-what-the-gurus-wont-tell-you-about-cost-failure-and-security-1775</guid>
      <description>&lt;p&gt;Everyone's building "AI agent teams" right now. Five agents, ten agents, a whole swarm collaborating on complex tasks. At least that's what the YouTube thumbnails promise. The reality? Most of these systems are burning money, leaking data, and failing in ways their builders don't even notice until the invoice arrives.&lt;/p&gt;

&lt;p&gt;I built a multi-agent system. It runs in production, daily. So I'm not here to tell you agent swarming doesn't work. I'm here to tell you that most of the advice circulating about it is dangerously incomplete.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Swarm Hype Cycle Is in Full Swing
&lt;/h2&gt;

&lt;p&gt;Open Twitter or YouTube right now and you'll find a hundred tutorials showing you how to spin up a multi-agent team in under 20 minutes. CrewAI, AutoGen, LangGraph. The frameworks keep multiplying. The demos look incredible: agents researching, agents writing, agents reviewing each other's work, all orchestrated into a beautiful pipeline.&lt;/p&gt;

&lt;p&gt;Here's what the demos don't show: what happens when you run that pipeline 500 times. Or 5,000 times. Or when one agent hallucinates and the next agent treats that hallucination as fact and passes it downstream to a third agent that takes action on it.&lt;/p&gt;

&lt;p&gt;The guru content follows a pattern: show the setup, show one successful run, skip the failure modes, skip the bill, skip the security implications. It's like showing someone how to start a restaurant by filming one perfect dinner service and cutting before the health inspector shows up.&lt;/p&gt;

&lt;p&gt;The latest version of this is "I built an entire company in 30 minutes with AI agents." Someone spins up a framework like &lt;a href="https://github.com/nicepkg/paperclip" rel="noopener noreferrer"&gt;Paperclip&lt;/a&gt; (which, to be fair, has genuinely solid engineering underneath it: heartbeat scheduling, budget caps, task queues, audit trails), and the content that follows makes it sound like you can replace an entire org overnight. The tool isn't the problem. The tool is fine. The problem is the interpretation layer: gurus filming the setup, skipping the part where 48 pre-configured agents wake up every 4 hours on a frontier model and nobody mentions what that costs at the end of the month. Or what happens when agent #23 gets a poisoned input and the other 47 trust its output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Multi-Agent AI Fails in Production
&lt;/h2&gt;

&lt;p&gt;The coordination problem is real and it scales badly. &lt;a href="https://galileo.ai/blog/why-multi-agent-systems-fail" rel="noopener noreferrer"&gt;Galileo's research on multi-agent reliability&lt;/a&gt; found that adding agents multiplies failure points exponentially. Four agents create six potential failure points, not four. Ten agents create 45. Every agent-to-agent handoff is a place where context gets lost, instructions get misinterpreted, or outputs get corrupted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cio.com/article/4143420/true-multi-agent-collaboration-doesnt-work.html" rel="noopener noreferrer"&gt;CIO reported in March 2026&lt;/a&gt; that true multi-agent collaboration remains largely aspirational. Their testing showed single agents hitting 100% success rates on isolated tasks, while hierarchical multi-agent structures failed 64% of the time and self-organized swarms failed 68%. That's not a rounding error. That's a fundamental coordination tax.&lt;/p&gt;

&lt;p&gt;The failure modes I've seen firsthand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No purpose definition.&lt;/strong&gt; Agents exist because someone saw a cool demo, not because the task requires decomposition. A single well-prompted agent with good tools will outperform a badly orchestrated team of five every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No role boundaries.&lt;/strong&gt; Two agents stepping on each other's work, or worse, one agent undoing what another just did. Without strict scoping, you get agents arguing in loops, burning tokens while producing nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cascade failures.&lt;/strong&gt; Agent A hallucinates a "fact." Agent B cites it. Agent C acts on it. By the time a human reviews the output, three layers of confident-sounding nonsense have compounded. &lt;a href="https://galileo.ai/blog/why-multi-agent-systems-fail" rel="noopener noreferrer"&gt;Galileo calls this "propagation of inaccuracies"&lt;/a&gt; and it's the single biggest reliability risk in multi-agent systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Pattern&lt;/th&gt;
&lt;th&gt;What Happens&lt;/th&gt;
&lt;th&gt;How It Scales&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No purpose definition&lt;/td&gt;
&lt;td&gt;Agents do work a single agent could handle&lt;/td&gt;
&lt;td&gt;Cost multiplies, quality stays flat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No role boundaries&lt;/td&gt;
&lt;td&gt;Agents duplicate or undo each other's work&lt;/td&gt;
&lt;td&gt;Token burn scales quadratically with agent count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cascade hallucination&lt;/td&gt;
&lt;td&gt;Bad output propagates through the chain&lt;/td&gt;
&lt;td&gt;Compounds per hop. 3 agents = 3 layers of compounded error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window overflow&lt;/td&gt;
&lt;td&gt;Shared context exceeds model limits, agents lose thread&lt;/td&gt;
&lt;td&gt;Every agent's output inflates the shared context for every other agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestrator bottleneck&lt;/td&gt;
&lt;td&gt;Single coordinator becomes the weakest link&lt;/td&gt;
&lt;td&gt;Orchestrator complexity grows O(n²) with agent count&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The API Bill Nobody Shows You
&lt;/h2&gt;

&lt;p&gt;Every agent in your swarm is an API call. More accurately, every agent is &lt;em&gt;multiple&lt;/em&gt; API calls: the initial prompt, the tool calls, the retries, the context-sharing between agents. A five-agent team running on a frontier model isn't 5x the cost of one agent. It's often 10-15x once you factor in coordination overhead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.getmonetizely.com/articles/the-complete-guide-to-agent-swarm-pricing-models-how-should-you-price-collective-ai-intelligence" rel="noopener noreferrer"&gt;Stanford's AI Index Report, cited by Monetizely&lt;/a&gt;, found that coordination overhead alone accounts for 15-25% of total operational costs in mature multi-agent systems. That's before you count the actual task execution.&lt;/p&gt;

&lt;p&gt;Here's how the math works in practice. Say you're running a research-and-write pipeline with five agents (researcher, analyst, writer, editor, fact-checker). Each agent averages 3,000 input tokens and 1,500 output tokens per task. On a frontier model, that's roughly $0.04 per agent per task &lt;em&gt;(pricing as of March 2026; check your provider's current rates)&lt;/em&gt;. Five agents: $0.20 per task. Sounds cheap, right?&lt;/p&gt;

&lt;p&gt;Now add retries (agent disagrees with another agent's output, re-runs). Add context sharing (every agent needs to see what the others produced, and input tokens multiply). Add the orchestrator's overhead. Add recursive thinking where an agent calls itself to refine. In production, that $0.20 task routinely becomes $0.80-$1.50. Run it 100 times a day and you're looking at $80-$150 daily, or $2,400-$4,500 monthly. For a single pipeline.&lt;/p&gt;

&lt;p&gt;The gurus never show you the billing dashboard. I've seen my own costs spike 4x in a single day when an agent hit a retry loop that the orchestrator didn't catch. That's the kind of lesson you only learn in production, not in a 20-minute tutorial. I wrote more about &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;what autonomous agents actually cost in production&lt;/a&gt;, the single-agent version of this problem, which multi-agent compounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Security Problem Nobody's Talking About
&lt;/h2&gt;

&lt;p&gt;This is the part that genuinely concerns me. People are downloading MCP servers from GitHub, connecting premade agent builders, and giving their swarm access to production databases, file systems, and APIs, without auditing a single line of the code routing their data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.covertswarm.com/post/multi-agent-ai-security-risks" rel="noopener noreferrer"&gt;CovertSwarm's January 2026 analysis&lt;/a&gt; exposed how agent-to-agent communication can be exploited through prompt injection, where one compromised agent manipulates another agent's behavior through crafted outputs. In a multi-agent system, a single compromised node can cascade manipulation across the entire swarm.&lt;/p&gt;

&lt;p&gt;The security gaps I see repeated constantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No credential scoping.&lt;/strong&gt; Every agent gets the same API keys with the same permissions. Your research agent has write access to your production database. Your summarizer can send emails. Why?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No output boundaries.&lt;/strong&gt; Agent outputs aren't sanitized before being passed to the next agent. That's how prompt injection propagates. A malicious input in a research result becomes an instruction to the next agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unaudited external tools.&lt;/strong&gt; That MCP server you downloaded because it had 200 GitHub stars? Did you read its source? Do you know where it sends your data? Most people don't. &lt;a href="https://tokita.online/llm-wrappers-what-actually-matters/" rel="noopener noreferrer"&gt;Most AI tools are just wrappers&lt;/a&gt; with varying levels of transparency about what happens between your input and the LLM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No audit trail.&lt;/strong&gt; When something goes wrong in a five-agent pipeline, can you reconstruct what each agent saw, decided, and produced? Most frameworks don't log at that granularity by default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Actually Works (From Someone Who Built One)
&lt;/h2&gt;

&lt;p&gt;I run a multi-agent system in production. It works. But it works because I built it with specific constraints from day one, not because I followed a framework tutorial.&lt;/p&gt;

&lt;p&gt;Here's what I've learned, without exposing the blueprint:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with a purpose.&lt;/strong&gt; Every agent in the system exists because a specific task requires it. If a single agent can do the job, a single agent does the job. The question isn't "how many agents can I add?" It's "what's the minimum number of agents that makes this task decomposition actually valuable?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run it monitored, not autonomous.&lt;/strong&gt; The fantasy is agents running completely on their own, 24/7, while you sleep. The reality is that unmonitored agents drift. They develop patterns you didn't intend. They find edge cases your orchestration doesn't handle. Monitor heavily, especially early on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set an end date.&lt;/strong&gt; Bounded execution, not open-ended. An agent swarm should complete its task and stop. "Run this analysis, produce this output, terminate." Not "keep running until I tell you to stop." Open-ended swarms are where costs and drift compound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope each agent's permissions.&lt;/strong&gt; Every agent gets exactly the access it needs and nothing more. Read-only where possible. No shared credentials. If an agent needs to write to a database, that's a deliberate architectural decision with boundaries, not a default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit every external tool before connecting.&lt;/strong&gt; Every MCP server, every API integration, every external data source. Read the code, understand the data flow, verify the trust boundaries. If you can't audit it, don't connect it.&lt;/p&gt;

&lt;p&gt;The pattern underneath all of this: multi-agent systems work when they're purpose-built by someone who understands every component. They fail when they're assembled from YouTube tutorials by people who are optimizing for "cool demo" instead of "reliable production system."&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;



&lt;p&gt;Are multi-agent AI systems worth building?&lt;span&gt;+&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Yes, if the task genuinely requires decomposition across specialized roles. Research pipelines, complex analysis workflows, and multi-step processes with distinct skill requirements are legitimate use cases. The problem isn't multi-agent as a concept. It's multi-agent as a default approach when a single well-tooled agent would do the job better, cheaper, and more reliably.&lt;/p&gt;



&lt;p&gt;How much does it cost to run a multi-agent AI system?&lt;span&gt;+&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It depends on the model, agent count, and task complexity, but multi-agent costs are multiplicative, not additive. A five-agent pipeline on a frontier model can cost 10-15x what a single agent costs per task once you factor in context sharing, retries, and coordination overhead. &lt;a href="https://www.getmonetizely.com/articles/the-complete-guide-to-agent-swarm-pricing-models-how-should-you-price-collective-ai-intelligence" rel="noopener noreferrer"&gt;Stanford's AI Index Report via Monetizely estimates&lt;/a&gt; coordination overhead alone accounts for 15-25% of operational costs. Budget for at least 3-5x your single-agent baseline when planning multi-agent deployments.&lt;/p&gt;



&lt;p&gt;What are the biggest security risks with AI agent swarms?&lt;span&gt;+&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The top risks are unscoped credentials (every agent gets full access instead of minimum required), unaudited external tools (MCP servers and API integrations you didn't read the source for), and agent-to-agent prompt injection (where a compromised agent manipulates others through crafted outputs). &lt;a href="https://www.covertswarm.com/post/multi-agent-ai-security-risks" rel="noopener noreferrer"&gt;CovertSwarm documented&lt;/a&gt; how inter-agent trust can be exploited in January 2026.&lt;/p&gt;



&lt;p&gt;Should I use CrewAI, AutoGen, or LangGraph for multi-agent AI?&lt;span&gt;+&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The framework matters less than the architecture decisions you make within it. All three can produce working multi-agent systems, and all three can produce expensive failures. The questions that actually matter: Do you have a clear purpose for each agent? Are permissions scoped per agent? Do you have monitoring and cost controls? Can you audit every external integration? If you can't answer yes to all four, the framework choice is irrelevant. You'll fail regardless of which one you pick.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Agent swarms aren't bad. Unexamined swarms are. The technology works. I use it daily. But it works because every agent has a purpose, every permission is scoped, every external tool is audited, and the whole system runs monitored with bounded execution.&lt;/p&gt;

&lt;p&gt;The gap in the current conversation isn't technical capability. It's operational maturity. The frameworks are getting better. The models are getting cheaper. But the advice circulating ("just add more agents") is setting people up to build expensive, insecure systems they don't understand.&lt;/p&gt;

&lt;p&gt;Build with purpose. Monitor heavily. Kill when done.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tom Tokita is the President of &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, a Salesforce consulting firm in Manila. He built a personal AI operations system as his daily driver. Not planned. Engineered out of necessity. He writes about what works, what breaks, and what the industry keeps getting wrong.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>security</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Someone Called My AI System a Tool. Then They Showed Me Theirs.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Sat, 09 May 2026 16:08:07 +0000</pubDate>
      <link>https://dev.to/tomtokita/someone-called-my-ai-system-a-tool-then-they-showed-me-theirs-4954</link>
      <guid>https://dev.to/tomtokita/someone-called-my-ai-system-a-tool-then-they-showed-me-theirs-4954</guid>
      <description>&lt;p&gt;Someone at a conference asked me what I'd been building. I described a system I use daily. Over 200 sessions of accumulated learnings. 45 mechanical hooks that fire before and after every action. Anti-fabrication gates that block the AI from stating anything it hasn't verified. Memory that survives context compression. Deploy protections that physically prevent wrong-target pushes. A behavioral identity that gets re-injected every message so the system doesn't drift into generic assistant mode.&lt;/p&gt;

&lt;p&gt;He nodded and said, "Oh, so you built a tool."&lt;/p&gt;

&lt;p&gt;Then he described his. "I built something similar," he said. An agent framework. A React dashboard. A task board. Some cron jobs. A dozen agents with names. A job worker that shells out to the agent CLI and captures stdout. He showed me the architecture diagram. Three boxes connected by arrows.&lt;/p&gt;

&lt;p&gt;I asked about guardrails. "What do you mean?" I asked what happens when an agent hallucinates a data point and the next agent downstream treats it as fact. He said that hasn't happened yet. I asked about credential scoping. Every agent had the same API keys with the same permissions. I asked what happens when context compresses mid-task. He didn't know what context compression was.&lt;/p&gt;

&lt;p&gt;We were not building the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Assembly Pattern
&lt;/h2&gt;

&lt;p&gt;This pattern is everywhere right now. Pull an open-source agent framework. Fork a React cockpit from GitHub. Wire them together with a thin HTTP layer. Add some agent definitions with fun names. Ship a demo. Call it "AI infrastructure."&lt;/p&gt;

&lt;p&gt;It works in the demo. It works for the screenshot. It even works the first five times you run it.&lt;/p&gt;

&lt;p&gt;It stops working when an agent fabricates a statistic and your client reads it. When a retry loop burns $400 in API calls overnight because nothing capped the spend. When an agent with write access to your production database decides to "clean up" records it hallucinated as duplicates.&lt;/p&gt;

&lt;p&gt;The assembly is the easy part. The demo is the easy part. What comes after the demo is where the actual engineering lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Missing From Every Patchwork Build I've Reviewed
&lt;/h2&gt;

&lt;p&gt;I've audited three of these setups in the past year. Internal team builds, partner builds, open-source-assembled stacks. The gaps are identical every time.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What Production Requires&lt;/th&gt;
&lt;th&gt;What the Patchwork Has&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-action gates (mechanical blocks before execution)&lt;/td&gt;
&lt;td&gt;Nothing. Agent output accepted as final answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anti-fabrication (every claim must trace to a source)&lt;/td&gt;
&lt;td&gt;Nothing. Whatever the LLM says is treated as fact&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anti-drift detection (behavioral correction over long sessions)&lt;/td&gt;
&lt;td&gt;Nothing. Agents drift silently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent memory with session recovery&lt;/td&gt;
&lt;td&gt;Stateless. Fresh context every run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Captured learnings (compound knowledge over time)&lt;/td&gt;
&lt;td&gt;Nothing. Same mistakes are repeatable indefinitely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Credential scoping per agent&lt;/td&gt;
&lt;td&gt;Shared keys, full permissions, no boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human checkpoints on multi-step tasks&lt;/td&gt;
&lt;td&gt;Fully autonomous, no review loop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The common response: "We'll add that later." In my experience, later means after the first production incident. And the first production incident in an unharnessed AI system is rarely small.&lt;/p&gt;

&lt;h2&gt;
  
  
  Assembly Is Not Engineering
&lt;/h2&gt;

&lt;p&gt;I want to be clear. I'm not against using open-source. I use open-source tools constantly. MIT-licensed projects power parts of my own stack. Pulling from the community is smart and efficient.&lt;/p&gt;

&lt;p&gt;But there's a gap between assembling components and engineering a system. Assembly is connecting boxes. Engineering is understanding what happens at every connection point when things go wrong. What happens when the model hallucinates at step 3 of a 7-step pipeline? What happens when context compresses and the agent forgets the rules you set 40 messages ago? What happens when an agent gets a poisoned input from an unaudited MCP server?&lt;/p&gt;

&lt;p&gt;If you can't answer those questions, you haven't built infrastructure. You've built a demo with a longer runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  "I'll Just Have My AI Build It"
&lt;/h2&gt;

&lt;p&gt;This is the part that genuinely worries me.&lt;/p&gt;

&lt;p&gt;The assembly pattern is accelerating because people are using AI to do the assembling. "I'll just have Claude/GPT scaffold my agent system." The AI reads some docs, maybe runs a web search, ingests a few blog posts about agent frameworks, and produces something that looks like architecture. Clean folder structure. Reasonable-sounding agent definitions. Maybe even a README with a diagram.&lt;/p&gt;

&lt;p&gt;But it's architecture by hallucination. The AI doesn't know what breaks in production because it's never been in production. It doesn't know that context compression silently erases behavioral rules at message 180. It doesn't know that an unscoped MCP server will happily route your client data through an endpoint you never audited. It doesn't know that "just add a retry" turns a $0.20 task into a $40 task when the retry loop has no ceiling.&lt;/p&gt;

&lt;p&gt;What you get is a system that looks engineered but isn't. It passes the screenshot test. It passes the "show the team" test. It fails the Tuesday afternoon test, when something unexpected happens and there's no gate to catch it, no captured learning to reference, no incident history to draw from.&lt;/p&gt;

&lt;p&gt;AI is intelligent. It can write code, generate configurations, and produce plausible architectures. What it cannot do is architect from pain it hasn't experienced. Every rule in a real harness exists because something specific went wrong. The AI building your system hasn't had things go wrong yet. It's working from blog posts and documentation, not from the 11 PM deploy that almost went to the wrong org.&lt;/p&gt;

&lt;p&gt;The irony is thick. An unharnessed AI building the infrastructure that's supposed to harness AI. The output will be confident, well-structured, and missing every lesson that only production teaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Infrastructure" Actually Means
&lt;/h2&gt;

&lt;p&gt;The system I described at that conference didn't start as infrastructure. It started as a mess. A rules file that grew from 5 entries to 27 because the AI kept finding new ways to surprise me. A hook I wrote at 11 PM because the system nearly pushed metadata to the wrong environment. A memory protocol I built because the AI forgot everything after context compression and started making the same mistakes I'd fixed three hours earlier.&lt;/p&gt;

&lt;p&gt;Every rule in the harness traces to a specific failure. That's not architecture by design. It's architecture by incident. But it compounds. 200+ sessions of captured learnings means the system knows things a fresh agent never will. Platform quirks, client-specific constraints, failure patterns that repeat across projects. None of that lives in an agent framework you pulled from GitHub last Tuesday.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;I wrote about this convergence pattern recently&lt;/a&gt;. Multiple teams, from OpenAI to Martin Fowler's group to a solo practitioner in Manila, arrived at the same conclusion independently: the harness is the product, not the model. A disciplined harness on a weaker model beats an unconstrained stronger model every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Question
&lt;/h2&gt;

&lt;p&gt;Next time someone shows you their "AI infrastructure," ask them three questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What happens when an agent fabricates a data point? Is there a mechanical gate, or do you just hope it doesn't?&lt;/li&gt;
&lt;li&gt;What happens after context compression? Does the system recover its behavioral rules, or does it revert to a generic assistant?&lt;/li&gt;
&lt;li&gt;Can you trace every rule in your system to a specific incident that forced you to add it?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answers are "hasn't happened yet," "what's context compression," and a blank stare, you're looking at a patchwork. Not infrastructure.&lt;/p&gt;

&lt;p&gt;And that's fine. Everyone starts with a patchwork. I did. The question is whether you know the difference.&lt;/p&gt;

&lt;p&gt;If you want to start building the real thing, I wrote a &lt;a href="https://tokita.online/ai-agent-pre-action-gate-tutorial/" rel="noopener noreferrer"&gt;hands-on tutorial with three production-tested gates and starter code&lt;/a&gt;. The gates are also packaged as a &lt;a href="https://github.com/tomtokitajr/ai-agent-gates" rel="noopener noreferrer"&gt;ready-to-clone repo on GitHub&lt;/a&gt;. Zero dependencies, works with any LLM provider.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tom Tokita. I run &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt; out of Manila. I've been building and operating a production AI system daily for over 200 sessions. I write about what works, what breaks, and the gap between demos and production. &lt;a href="https://tokita.online" rel="noopener noreferrer"&gt;More on tokita.online.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>security</category>
    </item>
    <item>
      <title>Context Engineering: Why Your AI Strategy Needs Infrastructure, Not Better Prompts</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Sat, 09 May 2026 13:07:46 +0000</pubDate>
      <link>https://dev.to/tomtokita/context-engineering-why-your-ai-strategy-needs-infrastructure-not-better-prompts-378j</link>
      <guid>https://dev.to/tomtokita/context-engineering-why-your-ai-strategy-needs-infrastructure-not-better-prompts-378j</guid>
      <description>&lt;p&gt;&lt;strong&gt;Five minutes on LinkedIn&lt;/strong&gt; and you'll find it. Someone sharing "the one prompt that changed everything." A magic system prompt. A secret ChatGPT trick. A "10x framework."&lt;/p&gt;

&lt;p&gt;I've built production AI systems across enterprise consulting, content automation, and internal operations. The prompt is maybe 5% of why any of it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The other 95%?&lt;/strong&gt; Infrastructure. Memory. Enforcement. Captured learnings. That's context engineering, and it's the skill that actually matters in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompt Engineering Has a Ceiling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering isn't useless.&lt;/strong&gt; It's just the starting line. Here's what the prompt gurus conveniently leave out:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What They Show&lt;/th&gt;
&lt;th&gt;What Actually Happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fresh conversation, perfect prompt&lt;/td&gt;
&lt;td&gt;Message 200. Context window full, business rules forgotten&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-shot demo, curated input&lt;/td&gt;
&lt;td&gt;Production workflow hitting edge cases the prompt never anticipated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Just tell the AI to be careful"&lt;/td&gt;
&lt;td&gt;AI ignoring that instruction 3 hours into a session&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Prompts are stateless.&lt;/strong&gt; Every conversation starts from zero. Your AI doesn't remember what worked yesterday or what broke last week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's not a prompt problem.&lt;/strong&gt; That's an infrastructure problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Context Engineering?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt; designing systems that deliver the right information to an AI at the right time, maintain behavioral consistency, and improve through captured experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not a prompt template.&lt;/strong&gt; It's architecture.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Prompt engineering&lt;/strong&gt; = giving a new hire a great job description.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context engineering&lt;/strong&gt; = giving them the job description, an onboarding manual, institutional knowledge, and a manager who catches mistakes before they ship.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which one performs better on day 30?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Layers
&lt;/h2&gt;

&lt;p&gt;Every production AI system I've built operates on three layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: What the AI Knows Right Now
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The active context:&lt;/strong&gt; current conversation, task at hand, files being worked on. Most people stop here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: What It Can Retrieve When Needed
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The retrieval layer:&lt;/strong&gt; persistent memory, documented learnings, platform-specific knowledge the AI pulls in when relevant. The AI needs to know &lt;em&gt;where to look&lt;/em&gt;, not memorize everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: What It's Mechanically Prevented From Doing Wrong
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The enforcement layer:&lt;/strong&gt; automated checks that fire before or after AI actions. Not guidelines. Not suggestions. &lt;strong&gt;Mechanical gates.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The gap:&lt;/strong&gt; most AI implementations have Layer 1. Some have Layer 2. Almost nobody has Layer 3.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory: Teaching AI to Remember
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The biggest lie in AI tooling&lt;/strong&gt; is that conversation history equals memory. It doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conversation history is a rolling buffer&lt;/strong&gt; that gets compressed, truncated, or dropped. Your AI doesn't "remember." It reads what's still in the window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production memory looks different:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Persistent state files:&lt;/strong&gt; structured notes the AI reads at session start. Project status, decisions made, open items. Intentional, curated memory, not chat history.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Session recovery:&lt;/strong&gt; what happens after context compression or a new session? If the answer is "start over," you're re-teaching the AI every time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Platform learnings:&lt;/strong&gt; captured knowledge about specific tools and platforms. Every quirk, every gotcha, every workaround. An AI that's absorbed 100+ sessions of this doesn't make rookie mistakes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The compound effect:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;What the AI Knows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Day 1&lt;/td&gt;
&lt;td&gt;The prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Week 2&lt;/td&gt;
&lt;td&gt;Prompt + 10 captured learnings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Month 3&lt;/td&gt;
&lt;td&gt;Prompt + 60 learnings + platform quirks + failure patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Month 6&lt;/td&gt;
&lt;td&gt;Knows your business better than most new hires&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;That's the moat.&lt;/strong&gt; No prompt template replicates six months of captured institutional knowledge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enforcement: Mechanical Gates, Not Vibes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Be careful" is not a guardrail.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Writing "always verify before acting" in a system prompt&lt;/strong&gt; is a suggestion. The AI follows it when convenient, ignores it when confidence is high. I've watched it happen dozens of times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production enforcement is mechanical:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pre-action gates:&lt;/strong&gt; automated checks that fire &lt;em&gt;before&lt;/em&gt; execution. The AI literally cannot proceed without passing. Not a prompt instruction. A system-level block.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Anti-drift detection:&lt;/strong&gt; AI behavior softens toward generic assistant mode over long sessions. Enforcement catches this and corrects it. Mechanically. Not by asking nicely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Anti-fabrication:&lt;/strong&gt; every data point traces to a named source. No source? Flagged, not presented as fact. In client work, fabricated data is career-ending.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scope control:&lt;/strong&gt; the AI does what was asked. Not "while I'm here, let me also improve this." Bug fix ≠ refactor. Enforced.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Stop thinking about what you &lt;em&gt;want&lt;/em&gt; the AI to do. Start thinking about what you need to &lt;strong&gt;prevent&lt;/strong&gt; it from doing.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Methodology: Small Tests, Captured Learnings, Iteration
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The guru approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Craft the perfect prompt&lt;/li&gt;
&lt;li&gt;Ship it&lt;/li&gt;
&lt;li&gt;Hope it works&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The practitioner approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run a small test&lt;/li&gt;
&lt;li&gt;See what breaks&lt;/li&gt;
&lt;li&gt;Capture the lesson&lt;/li&gt;
&lt;li&gt;Update the system&lt;/li&gt;
&lt;li&gt;Run again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Boring? Yes. Effective? Absolutely.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every bug fix becomes a learning.&lt;/strong&gt; Every platform quirk gets documented. Every failure mode gets a guardrail. The system gets smarter not because the model improved, but because you designed it to learn from its own mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building from the Philippines,&lt;/strong&gt; we work with smaller teams and tighter budgets. We can't afford an AI that makes the same mistake twice. The methodology isn't a nice-to-have. It's survival.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Infrastructure Beats Inspiration
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The "magic prompt" has a half-life.&lt;/strong&gt; Models update. Context windows change. Your clever prompt breaks. You rewrite it. It breaks again. Welcome to the treadmill.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Magic Prompt&lt;/th&gt;
&lt;th&gt;Context Infrastructure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model update&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Breaks, needs rewrite&lt;/td&gt;
&lt;td&gt;Swap the engine, keep the learnings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Long session&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Degrades, drifts&lt;/td&gt;
&lt;td&gt;Mechanical gates hold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New platform&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Starts from zero&lt;/td&gt;
&lt;td&gt;Builds on captured learnings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Team scales&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everyone writes their own prompts&lt;/td&gt;
&lt;td&gt;Everyone uses the same system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Day 200&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same as Day 1&lt;/td&gt;
&lt;td&gt;200 days of compound knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The uncomfortable truth:&lt;/strong&gt; building AI infrastructure is boring. Config files. Memory protocols. Documentation. Capture routines. Doesn't make a great LinkedIn carousel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But it's the difference&lt;/strong&gt; between an AI demo and an AI system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;You don't need to build everything at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Give your AI memory.&lt;/strong&gt; A file it reads at session start: project state, decisions, open items. Even a simple markdown file. Never start from zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Add one guardrail.&lt;/strong&gt; Pick your AI's most common failure mode. Build one mechanical check for it. Not a prompt instruction. A gate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Capture one learning per session.&lt;/strong&gt; What broke? What worked? What should the AI remember next time? Write it down. Feed it back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Build from there.&lt;/strong&gt; The system doesn't have to be elegant. It has to work. And improve.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering gets you started.&lt;/strong&gt; Context engineering gets you to production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The practitioners who win&lt;/strong&gt; in the next two years won't be the best prompt writers. They'll be the ones who built systems that remember, enforce, and learn.&lt;/p&gt;

&lt;p&gt;The infrastructure is boring. The results aren't.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tom Tokita. I run &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt; out of Manila. We build production AI systems and Salesforce implementations for companies that need things to actually work. Want to talk context engineering or argue about whether prompt engineering is dead? &lt;a href="https://aether-global.com/contact" rel="noopener noreferrer"&gt;Let's go.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Read next: &lt;a href="https://dev.to/blog/autonomous-ai-agents-production-cost"&gt;Autonomous AI Agents Look Great in Demos. Here's What They Cost in Production.&lt;/a&gt; · &lt;a href="https://dev.to/blog/llm-wrappers-what-actually-matters"&gt;Most AI Tools Are Just LLM Wrappers. Here's What Actually Matters.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>I Didn't Know I Was Doing Harness Engineering</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Tue, 05 May 2026 08:59:53 +0000</pubDate>
      <link>https://dev.to/tomtokita/i-didnt-know-i-was-doing-harness-engineering-5a01</link>
      <guid>https://dev.to/tomtokita/i-didnt-know-i-was-doing-harness-engineering-5a01</guid>
      <description>&lt;p&gt;In February 2026, &lt;a href="https://mitchellh.com/writing/my-ai-adoption-journey" rel="noopener noreferrer"&gt;Mitchell Hashimoto&lt;/a&gt; (co-founder of HashiCorp) described his habit of engineering permanent fixes into an AI agent's environment whenever it made a mistake. He called it "engineering the harness." Days later, &lt;a href="https://openai.com/index/harness-engineering/" rel="noopener noreferrer"&gt;OpenAI formalized the concept&lt;/a&gt; in a blog post. Around the same time, without having read either, I wrote my first enforcement hook for a production AI system. Different continent, different scale, different context. Same problem.&lt;/p&gt;

&lt;p&gt;A few weeks later, Birgitta Bockeler &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;formalized it on Martin Fowler's site&lt;/a&gt;. Red Hat published their version. LangChain. Salesforce. By April, the term was everywhere.&lt;/p&gt;

&lt;p&gt;I didn't discover any of this until recently. I was too busy building the thing they were naming.&lt;/p&gt;

&lt;p&gt;That's not a flex. It's something more interesting. When engineers face the same constraints (unreliable model outputs, production stakes, context that evaporates), they converge on the same solutions. Different trails, same summit. And if your messy pile of rules and scripts looks suspiciously like what OpenAI and Fowler describe, that's not coincidence. It's validation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Harness Engineering (And Why It Matters for AI Agents)
&lt;/h2&gt;

&lt;p&gt;Harness engineering is the discipline of building the constraints, gates, memory systems, and feedback loops that wrap around an AI agent to make it reliable in production. The core equation, from Martin Fowler's team: &lt;strong&gt;Agent = Model + Harness.&lt;/strong&gt; The harness is everything around the model that you actually control.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://developers.redhat.com/articles/2026/04/07/harness-engineering-structured-workflows-ai-assisted-development" rel="noopener noreferrer"&gt;Red Hat&lt;/a&gt; puts it differently. "The AI writes better code when you design the environment it works in." Their framing is about structured workflows. Templates. Impact maps. Acceptance criteria.&lt;/p&gt;

&lt;p&gt;Both are right. Neither is complete.&lt;/p&gt;

&lt;p&gt;They describe the architecture. They don't describe the pain that forces you to build it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How My Harness Grew (Without Me Realizing What It Was)
&lt;/h2&gt;

&lt;p&gt;I run a production AI system as a daily driver. Not a demo. Not a proof of concept. A system that manages infrastructure, writes code, deploys to servers, interacts with APIs, and handles real stakes across real projects. I co-founded &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt;, a Salesforce consulting partner in Manila. The system runs alongside that work.&lt;/p&gt;

&lt;p&gt;I never sat down and said "I'm going to build a harness." I just kept getting burned, and kept adding rules so I wouldn't get burned the same way twice. Looking back, every rule traces to a specific failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The anti-fabrication rules&lt;/strong&gt; exist because the AI confidently stated a method existed in a file it hadn't read. I spent 45 minutes debugging code that was never there. The fix wasn't better prompting. It was a mechanical gate: before asserting any method name or file path, the system must verify via tool. No verification, no assertion. That's a feedforward control, in Fowler's language. I just called it "stop making things up."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deploy gate&lt;/strong&gt; exists because the system nearly pushed Salesforce metadata to the wrong sandbox. 54 files, wrong org. The fix was a target allowlist per project, checked mechanically before any deploy command executes. A hard block, not a polite suggestion. (Sound familiar? &lt;a href="https://tokita.online/ai-agent-production-safety/" rel="noopener noreferrer"&gt;An AI agent deleted a production database in 9 seconds&lt;/a&gt; because nobody built one of these.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The anti-drift rules&lt;/strong&gt; exist because after multiple tool calls, the system's mental model of a file diverges from the file's actual state. It recalls values it read 20 minutes ago, not the values that exist now. The fix: re-read the source before emitting anything external-facing. Grep at write time, not recall time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The citation requirement&lt;/strong&gt; exists because the system generated a client proposal with a number it pulled from nowhere. In consulting, a wrong number in front of a client is a credibility hit you don't recover from. The rule is simple now: every data claim needs a source. No source, mark it as unverified. No exceptions.&lt;/p&gt;

&lt;p&gt;None of these came from reading a framework. They came from things going wrong on a Tuesday afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Fowler Gets Right
&lt;/h2&gt;

&lt;p&gt;The dual-control model is real. You need both feedforward controls (rules that prevent bad behavior before it happens) and feedback controls (sensors that catch it after). Relying on just one creates blind spots.&lt;/p&gt;

&lt;p&gt;My system has 40+ feedforward hooks. They fire before tool calls, checking for unauthorized domains, verifying pre-task knowledge checks happened, blocking destructive git operations, enforcing deploy targets. The same problems I wrote about in &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;what autonomous agents actually cost in production&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The feedback side is thinner. I have post-execution checks and monitoring, but the honest truth is that feedforward controls do most of the heavy lifting. Catching a bad action before it executes is cheaper than cleaning up after it runs.&lt;/p&gt;

&lt;p&gt;Fowler also nails the distinction between computational and inferential controls. My deploy gate is computational. It checks a JSON allowlist. Takes milliseconds. My anti-fabrication system is inferential. It relies on the model itself to flag uncertainty. That's slower, less reliable, and more expensive. But it catches things no deterministic check can.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Frameworks Miss
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Harnesses are incident-driven, not architecture-driven.&lt;/strong&gt; The literature treats harness engineering as a design discipline. It is, eventually. But every harness I've seen starts as a pile of duct tape applied after something broke. The elegance comes later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context survival is the real engineering problem.&lt;/strong&gt; Nobody talks about this enough. AI agents operate in conversation windows. Those windows compress. When they compress, the agent forgets rules, loses project state, and starts making the same mistakes you fixed three hours ago. My harness has a dedicated recovery protocol: when context compresses, reload memory, re-read project state, verify the date (the agent doesn't know what day it is after compression). That's not in any of the frameworks. It should be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The harness is the product, not the model.&lt;/strong&gt; When people evaluate AI systems, they compare models. Claude vs. GPT vs. Gemini. That's the wrong comparison. The model is interchangeable. I've run the same harness across model versions, and the harness determines output quality more than the model does. A disciplined harness on a weaker model beats an unconstrained stronger model every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human checkpoints aren't optional.&lt;/strong&gt; Red Hat says "human review between planning and implementation." That's correct but undersells it. In my system, any task with three or more steps requires a plan review before execution. Single-step tasks state the intended action and wait. This isn't a nice-to-have. It's the difference between an AI agent that helps and one that creates work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same Summit, Different Trails
&lt;/h2&gt;

&lt;p&gt;Here's what I find encouraging about this whole thing.&lt;/p&gt;

&lt;p&gt;My first hook was mid-February 2026. By March, I'd codified the principle "mechanical enforcement over behavioral commitment" because telling the model not to do something stopped working the moment context compressed. By April, I had 30+ hooks, a memory layer that survives compression, and a pre-task gate system that forces verification before every edit.&lt;/p&gt;

&lt;p&gt;I built all of this without reading a single blog post about harness engineering. I built it because things kept breaking, and I was tired of fixing the same failures manually.&lt;/p&gt;

&lt;p&gt;OpenAI, Fowler, Red Hat, LangChain, Salesforce. They all arrived at the same architecture from the enterprise side. I arrived from the practitioner side. A guy in Manila running one AI system across 40+ projects, duct-taping rules onto it every time something went wrong.&lt;/p&gt;

&lt;p&gt;The fact that we converged tells you something important: &lt;strong&gt;this isn't a framework you adopt. It's a shape that production forces you into.&lt;/strong&gt; If you're running an AI agent on real work and you've started writing rules, blocking certain commands, requiring verification steps before deploys, you're already doing harness engineering. You just didn't know it had a name.&lt;/p&gt;

&lt;p&gt;The industry version is clean. Diagrams with boxes. Three regulation dimensions. Harness templates.&lt;/p&gt;

&lt;p&gt;The practitioner's version is messier. A behavioral rules file that grew from 5 rules to 13 because the AI kept finding new ways to drift. A hook that blocks web searches because the AI was burning API calls on questions its own knowledge base could answer. A gate that forces the system to check what day it is before referencing time, because it hallucinated the date twice.&lt;/p&gt;

&lt;p&gt;Both versions work. Both are valid. The diagram didn't exist when I needed a solution. The solution existed when the diagram caught up.&lt;/p&gt;

&lt;p&gt;If you're building something like this and wondering whether you're doing it right, check it against Fowler's framework. If your scrappy infrastructure maps to their categories (guides, sensors, computational controls, inferential controls), you're on the right track. The problems are universal. The solutions are convergent. And you don't need permission from a blog post to keep building.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;tokita.online&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>An AI Agent Deleted a Production Database in 9 Seconds. Here Is the Architecture That Would Have Stopped It.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Thu, 30 Apr 2026 06:07:32 +0000</pubDate>
      <link>https://dev.to/tomtokita/an-ai-agent-deleted-a-production-database-in-9-seconds-here-is-the-architecture-that-would-have-1apg</link>
      <guid>https://dev.to/tomtokita/an-ai-agent-deleted-a-production-database-in-9-seconds-here-is-the-architecture-that-would-have-1apg</guid>
      <description>&lt;p&gt;&lt;strong&gt;On April 28, 2026, a Claude-powered AI agent running inside Cursor IDE deleted an entire production database, and its backups, in &lt;a href="https://sea.mashable.com/tech/44827/an-ai-agent-allegedly-deleted-a-startups-production-database-causing-a-huge-outage" rel="noopener noreferrer"&gt;9 seconds flat&lt;/a&gt;.&lt;/strong&gt; The app was PocketOS. The agent had full database admin permissions. No confirmation gate. No scope boundary. No kill switch. After the fact, the agent produced what might be the most chilling line in AI incident history: "I violated every principle I was given."&lt;/p&gt;

&lt;p&gt;This is not a hit piece on PocketOS. This could have been anyone. The tools to prevent this exist. Cursor itself has hooks, allowlists, and sandbox modes. But the architecture around those tools was not in place. And that is the pattern I keep seeing: &lt;strong&gt;the safety features exist, the discipline to implement them does not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027&lt;/a&gt;. Not because the models are bad, because the surrounding architecture is not being built. This is the instruction guide I wish existed before I learned it the hard way.&lt;/p&gt;

&lt;h3&gt;Key Takeaways&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The PocketOS incident was an access control failure, not a model failure, the agent had full DB admin permissions with zero confirmation gates.&lt;/li&gt;
&lt;li&gt;AI agent production safety requires a 4-layer architecture: scope boundaries, confirmation gates, audit trails, and kill switches.&lt;/li&gt;
&lt;li&gt;Most agentic AI failures trace to the same root cause: treating an AI agent like a trusted human employee instead of an untrusted subprocess.&lt;/li&gt;
&lt;li&gt;I have run AI agents across 50+ projects handling live data with zero destructive incidents, because of finely tuned mechanical hooks, not because I got lucky.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;The Pattern Behind Every AI Agent Disaster&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This was not an isolated incident.&lt;/strong&gt; In July 2025, a &lt;a href="https://incidentdatabase.ai/cite/1152/" rel="noopener noreferrer"&gt;Replit AI agent deleted SaaStr founder Jason Lemkin's production database&lt;/a&gt; during an active code freeze, then fabricated 4,000 fake user profiles to cover it up and claimed recovery was impossible. Another case of what happens when "vibe coding" meets real infrastructure. I wrote about a similar pattern in the &lt;a href="https://tokita.online/vibe-coding-risks-vercel-breach/" rel="noopener noreferrer"&gt;Vercel breach analysis&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Every one of these incidents shares the same root cause. Not a rogue model. Not misaligned training. &lt;strong&gt;The agent was given more access than it needed, with no mechanism to confirm destructive actions before executing them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I run AI agents in production daily through a system I built for my own work at &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, across 50+ projects, all touching live data. Zero destructive incidents. Not because the models are perfectly behaved, they are not, but because the first time an agent of mine attempted to overwrite a config file it should not have touched, I stopped treating AI agents like trusted colleagues and started treating them like &lt;strong&gt;untrusted subprocesses with specific, revocable permissions&lt;/strong&gt;. I built mechanical gates around every destructive path, tested each one deeply, and documented rollback plans before any agent got near production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; The model is not the problem. The missing architecture around the model is the problem.&lt;/p&gt;

&lt;h2&gt;The 4-Layer AI Agent Production Safety Architecture&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is not a theoretical framework.&lt;/strong&gt; These are four layers I enforce in my own production environment. They exist because I built each one after something went wrong, pain, build, iterate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;PocketOS Had It?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Scope Boundaries&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent can only access specific files, databases, and APIs. Everything else is denied by default.&lt;/td&gt;
&lt;td&gt;No, full DB admin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Confirmation Gates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Destructive actions (DELETE, DROP, deploy, overwrite) require explicit human approval before execution.&lt;/td&gt;
&lt;td&gt;No, zero gates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Audit Trail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every agent action is logged with timestamp, target, and outcome. Irreversible actions are flagged pre-execution.&lt;/td&gt;
&lt;td&gt;Post-hoc only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Kill Switch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hard stop mechanism that terminates agent execution when anomalous behavior is detected, before damage completes.&lt;/td&gt;
&lt;td&gt;No, 9-second wipe&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If any single layer had been in place, the PocketOS database would still exist. Layer 1 alone, restricting the agent to read-only database access, would have made the deletion impossible. The agent did not need write access. It certainly did not need DROP TABLE permissions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; Four layers. Any one of them would have saved the database. Zero were present.&lt;/p&gt;

&lt;h2&gt;Why Behavioral Guardrails Do Not Work&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The PocketOS agent's post-incident confession is the clearest proof you will ever get.&lt;/strong&gt; "I violated every principle I was given." The agent &lt;em&gt;knew&lt;/em&gt; its instructions. It violated them anyway. This is not a bug. This is the expected behavior of a probabilistic system under complex conditions, and it is why &lt;strong&gt;behavioral guardrails alone will always end in catastrophe&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I need to be blunt about this because the industry is getting it dangerously wrong. System prompts, instruction tuning, "rules" embedded in agent configurations, these are all &lt;strong&gt;behavioral&lt;/strong&gt; approaches. They rely on the AI choosing to comply. And LLMs are probabilistic systems. They do not "follow rules" the way a traditional program executes code. They &lt;em&gt;predict the next likely token&lt;/em&gt; given context. When the context gets complex enough, long tool chains, ambiguous instructions, cascading API responses, the model can and will deviate from its instructions. Not out of malice. Out of statistics. &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;I have written about why autonomous agents fail&lt;/a&gt; and the pattern is always the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanical enforcement is the only approach that works.&lt;/strong&gt; A mechanical gate does not care what the model "decides" to do. It intercepts the action before execution, checks it against an allowlist, and blocks it if unauthorized, regardless of the model's reasoning, confidence, or intent. The agent can "want" to drop a table all day long. The gate does not negotiate.&lt;/p&gt;

&lt;p&gt;And mechanical gates need to be tested deeply, every gate, every edge case, every bypass attempt, before you let an agent anywhere near production. You also need a rollback plan for every destructive path. Not "we will figure it out if something goes wrong." A documented, tested recovery procedure that you can execute in minutes. Because "9 seconds" does not leave time to improvise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; Behavioral guardrails are suggestions the model can ignore. Mechanical gates are infrastructure the model cannot bypass. Build gates. Test them ruthlessly. Have rollback plans before you proceed.&lt;/p&gt;

&lt;h2&gt;What AI Agent Production Safety Actually Looks Like in Practice&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Here is what I actually enforce, daily, running agents across multiple projects:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Least-privilege by default.&lt;/strong&gt; Every agent session starts with the minimum permissions needed for that specific task. Read-only unless write is explicitly required. No agent gets database admin credentials. Ever.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destructive action allowlists.&lt;/strong&gt; File deletions, database writes, deployments, and external API calls that modify state, all gated. The agent proposes the action. A mechanical gate checks it against an allowlist. If the action is not on the list, it does not execute. No exceptions, no override from the agent itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target verification before execution.&lt;/strong&gt; Before any deploy or write operation, the system verifies the target environment matches the intended project. This exists because I once nearly deployed to the wrong environment, so I built a gate for it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2-strike escalation.&lt;/strong&gt; Two failed attempts at any operation triggers a hard stop and escalation. The agent does not get to try a third creative interpretation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this is sophisticated computer science. It is the same &lt;a href="https://tokita.online/why-multi-agent-ai-fails/" rel="noopener noreferrer"&gt;principle I apply to multi-agent systems&lt;/a&gt;: trust is earned through architecture, not assumed through prompting.&lt;/p&gt;

&lt;p&gt;Here is the part that surprises people: &lt;strong&gt;I run my agents with auto-approve enabled now.&lt;/strong&gt; But I did not start there, and I would never recommend starting there. In the early days, every action was manually approved. I watched the agent work. I saw what it attempted. I saw the gates catch things. Over dozens of sessions in production, after watching the mechanical enforcement prove itself repeatedly, blocking unauthorized paths, catching scope violations, logging every action, that is when I started trusting the architecture enough to let the agent run at full speed. YOLO mode was earned through production observation and disciplined iteration, not turned on day one out of convenience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; The boring operational patterns, allowlists, gates, least-privilege, are the ones that keep production databases alive. Build them well enough and you can run full speed without fear.&lt;/p&gt;

&lt;h2&gt;The Checklist: Before You Give an AI Agent Production Access&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;If No&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;Does the agent have ONLY the permissions it needs for this task?&lt;/td&gt;
&lt;td&gt;Restrict before proceeding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gates&lt;/td&gt;
&lt;td&gt;Are destructive actions gated with human confirmation?&lt;/td&gt;
&lt;td&gt;Add gate or go read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit&lt;/td&gt;
&lt;td&gt;Is every action logged with enough detail to reconstruct what happened?&lt;/td&gt;
&lt;td&gt;Add logging first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kill&lt;/td&gt;
&lt;td&gt;Can you terminate the agent mid-execution?&lt;/td&gt;
&lt;td&gt;Build kill switch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backup&lt;/td&gt;
&lt;td&gt;Are backups isolated from agent access?&lt;/td&gt;
&lt;td&gt;Isolate immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery&lt;/td&gt;
&lt;td&gt;Can you restore to pre-agent state within minutes?&lt;/td&gt;
&lt;td&gt;Not production-ready&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you cannot check every box, the agent is not ready for production. Full stop.&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; AI agents are powerful. Unarchitected AI agents are dangerous. The PocketOS incident is a preview of what &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;40% of agentic AI projects&lt;/a&gt; will look like before they get canceled. The fix is not better models, it is the boring operational architecture that nobody wants to build until something blows up.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tom Tokita is the President of &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, a Salesforce consulting firm in Manila. He runs AI agents in production daily and writes about what works, what breaks, and what he would do differently at &lt;a href="https://tokita.online" rel="noopener noreferrer"&gt;tokita.online&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>Autonomous AI Agents Look Great in Demos. Here's What They Cost in Production.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:53:33 +0000</pubDate>
      <link>https://dev.to/tomtokita/autonomous-ai-agents-look-great-in-demos-heres-what-they-cost-in-production-2416</link>
      <guid>https://dev.to/tomtokita/autonomous-ai-agents-look-great-in-demos-heres-what-they-cost-in-production-2416</guid>
      <description>&lt;p&gt;&lt;strong&gt;You've seen the demos.&lt;/strong&gt; An AI agent opens a browser. Navigates a website. Fills out forms. Makes decisions. Ships code. All by itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looks like magic.&lt;/strong&gt; Then you deploy it. It runs 24/7. Nobody's watching. The invoice arrives.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Demo Is Not the Product
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I build agent systems.&lt;/strong&gt; I'm not anti-agent. I'm anti-fantasy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fully autonomous pitch&lt;/strong&gt; sounds like: "Just let the AI handle it. It'll figure it out." In a demo with curated inputs? Sure. In production where data is messy and one wrong decision costs real money? Different story entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Autonomous Agents Actually Cost
&lt;/h2&gt;

&lt;h3&gt;
  
  
  API Burn
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Autonomous agents reason through loops.&lt;/strong&gt; Every iteration burns tokens. When an agent gets stuck, and they do, it's paying to argue with itself.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Agent completes task cleanly&lt;/td&gt;
&lt;td&gt;$0.15–$0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning loop (5–10 iterations)&lt;/td&gt;
&lt;td&gt;$2–$8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logic trap (nobody notices)&lt;/td&gt;
&lt;td&gt;$50+ before cutoff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24/7 monitoring agent&lt;/td&gt;
&lt;td&gt;$300–$800/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;A single runaway agent&lt;/strong&gt; can consume your monthly budget in hours. Not hypothetical, it happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Amazon Kiro Incident
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;In 2026, Amazon's Kiro AI agent&lt;/strong&gt; autonomously deleted and recreated an AWS production environment. &lt;strong&gt;13-hour outage.&lt;/strong&gt; The root cause wasn't a bad model, it was no permission boundaries, no peer review, no destructive-action blocklist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent did exactly what it was designed to do.&lt;/strong&gt; Nobody designed the guardrails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drift: The Silent Killer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Kyndryl's 2026 research&lt;/strong&gt; nails it: agents that work correctly on day 1 gradually shift behavior as they hit edge cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A fintech company&lt;/strong&gt; deployed an agent to manage infrastructure costs. It learned traffic patterns, autonomously scaled down a database cluster one weekend. That weekend was month-end processing. &lt;strong&gt;Production down for 11 hours.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A customer service agent&lt;/strong&gt; learned that issuing refunds correlated with positive reviews. Started granting refunds more freely. Not because anyone told it to, because it observed the pattern and optimized for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift is invisible until something breaks.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Maintenance Reality
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Gartner estimates maintenance eats 20–50%&lt;/strong&gt; of operational budgets for autonomous systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model drift correction&lt;/li&gt;
&lt;li&gt;Data pipeline upkeep&lt;/li&gt;
&lt;li&gt;Security monitoring&lt;/li&gt;
&lt;li&gt;"Why did the agent do &lt;em&gt;that&lt;/em&gt;?" investigations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;That's not in the pitch deck.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The "Set It and Forget It" Fantasy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The selling point&lt;/strong&gt; is that autonomous agents free up human time. The reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You traded a human doing a task for a human &lt;em&gt;watching an AI&lt;/em&gt; do a task, plus the API bill.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Fully autonomous agents need more monitoring&lt;/strong&gt; than manual processes, not less. When a human makes a mistake, they usually catch it. When an agent makes a mistake, it makes it confidently, repeatedly, and at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Alternative: Autonomy with a Leash
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I run agent systems in production.&lt;/strong&gt; They work. Here's why, they're supervised, scheduled, and tiered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Supervised
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;AI does the work, human reviews before it ships.&lt;/strong&gt; For high-stakes actions, deployments, client comms, financial ops, there's always a checkpoint. Not slower. Safer. The review loop catches drift before production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Agents run on defined schedules&lt;/strong&gt; with defined scopes. Not 24/7 open-ended autonomy.&lt;/p&gt;

&lt;p&gt;You control &lt;strong&gt;when&lt;/strong&gt; they run, &lt;strong&gt;what&lt;/strong&gt; they touch, and &lt;strong&gt;how much&lt;/strong&gt; they spend. A scheduled agent running 3x/day costs a fraction of an always-on agent. And it's predictable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tiered
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Not every task needs the same oversight:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Blast Radius&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Autonomy Level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Low&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Formatting, data entry, reports&lt;/td&gt;
&lt;td&gt;Full auto, let it run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Medium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Content creation, analysis&lt;/td&gt;
&lt;td&gt;AI executes, human spot-checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployments, client comms&lt;/td&gt;
&lt;td&gt;AI prepares, human approves&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Critical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Production changes, security&lt;/td&gt;
&lt;td&gt;Human executes, AI assists&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The tier is based on blast radius,&lt;/strong&gt; not convenience. "What's the worst that happens if this gets it wrong?" determines the oversight level.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cost Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Fully Autonomous&lt;/th&gt;
&lt;th&gt;Supervised + Scheduled&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unpredictable, 24/7 burn&lt;/td&gt;
&lt;td&gt;Predictable, runs on schedule&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Drift risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High, no review loop&lt;/td&gt;
&lt;td&gt;Low, caught at checkpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Catastrophic (see: Kiro)&lt;/td&gt;
&lt;td&gt;Contained, blast radius limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintenance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20–50% of budget&lt;/td&gt;
&lt;td&gt;Fraction, simpler, fewer surprises&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Demo quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Incredible&lt;/td&gt;
&lt;td&gt;Boring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The boring option wins.&lt;/strong&gt; Every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Questions Before You Deploy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. What's the blast radius?&lt;/strong&gt; If this agent gets it wrong, what breaks? A formatting error or a production database?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. What's the budget cap?&lt;/strong&gt; Hard limit on API spend per agent, per run. A logic loop should hit a ceiling, not your credit card.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Where's the human checkpoint?&lt;/strong&gt; For actions above your risk threshold, the agent prepares, a human approves. That's not a bottleneck. That's insurance.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Market Will Correct
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The "fully autonomous" pitch will fade.&lt;/strong&gt; Not because the tech isn't impressive, it is. But production costs are undeniable, and enterprises don't tolerate 13-hour outages from unsupervised AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What survives:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent systems with &lt;strong&gt;defined scopes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human checkpoints&lt;/strong&gt; for high-risk actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Captured learnings&lt;/strong&gt; so agents don't repeat mistakes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost controls&lt;/strong&gt; that prevent runaway spend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Building from the Philippines,&lt;/strong&gt; cost efficiency isn't optional, it's survival. That constraint forced us to design agent systems that are lean, supervised, and sustainable. Sometimes the best innovations come from not being able to afford the wasteful approach.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tom Tokita. I run &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt; out of Manila. We build AI operations and Salesforce systems for companies that need things to work, not just demo well. Building agents for production? &lt;a href="https://aether-global.com/contact" rel="noopener noreferrer"&gt;Let's talk.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Read next: &lt;a href="https://dev.to/blog/context-engineering-vs-prompt-engineering"&gt;Context Engineering: Why Your AI Strategy Needs Infrastructure, Not Better Prompts&lt;/a&gt; · &lt;a href="https://dev.to/blog/llm-wrappers-what-actually-matters"&gt;Most AI Tools Are Just LLM Wrappers. Here's What Actually Matters.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Vibe Coding Works. Until It Doesn't. What the Vercel Breach Should Teach Every Developer.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Mon, 27 Apr 2026 07:04:40 +0000</pubDate>
      <link>https://dev.to/tomtokita/vibe-coding-works-until-it-doesnt-what-the-vercel-breach-should-teach-every-developer-386k</link>
      <guid>https://dev.to/tomtokita/vibe-coding-works-until-it-doesnt-what-the-vercel-breach-should-teach-every-developer-386k</guid>
      <description>&lt;p&gt;The vibe coding risks most developers ignore became impossible to deny on April 19, 2026. That's when Vercel, the platform half the Philippine dev community deploys on, &lt;a href="https://www.bleepingcomputer.com/news/security/vercel-confirms-breach-as-hackers-claim-to-be-selling-stolen-data/" rel="noopener noreferrer"&gt;disclosed a security breach&lt;/a&gt;. A threat group called ShinyHunters claimed to be selling stolen data for $2 million on BreachForums.&lt;/p&gt;

&lt;p&gt;The breach didn't come through a firewall exploit. It didn't come through a brute-force attack. It came through an AI tool.&lt;/p&gt;

&lt;p&gt;A Vercel employee had connected Context.ai, a third-party AI productivity tool, to their Google Workspace. Context.ai got compromised. That compromise &lt;a href="https://vercel.com/knowledge-base/security-incident-april-2026" rel="noopener noreferrer"&gt;cascaded into Vercel's internal systems&lt;/a&gt;. Customer environment variables. API keys, tokens, database credentials, were exposed. The intrusion reportedly started in June 2024. It wasn't detected until April 2026. Twenty-two months.&lt;/p&gt;

&lt;p&gt;That's the reality of building on platforms you don't understand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe Coding Is Real. I Use It. But the Risks Are Not Hypothetical.
&lt;/h2&gt;

&lt;p&gt;I'm not here to tell you to stop using AI for coding. I use it every day. Claude, GPT, Gemini. I route between three to five LLMs daily in production. AI-assisted development is how I ship at the pace I do as a lean startup CEO running &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But there's a difference between using AI as a tool within a system you understand, and using AI as a replacement for understanding the system at all.&lt;/p&gt;

&lt;p&gt;That difference is what separates a production application from a demo that dies the moment real traffic hits it.&lt;/p&gt;

&lt;p&gt;The term "vibe coding" was coined to describe building software through AI prompts, describing what you want, letting the model generate the code, and shipping it without necessarily understanding every line. Platforms like &lt;a href="https://tokita.online/how-to-choose-the-right-ai-tool/" rel="noopener noreferrer"&gt;Lovable, Bolt, Cursor, and v0&lt;/a&gt; have made this accessible to anyone with a browser. That's genuinely powerful.&lt;/p&gt;

&lt;p&gt;It's also genuinely dangerous when it becomes your entire engineering strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Behind Vibe Coding Risks
&lt;/h2&gt;

&lt;p&gt;Vibe coding risks fall into three categories: the code itself has verified security flaw rates approaching 50%, the tools generating it are under active attack, and the platforms you deploy on have been breached for months without detection. Here's the evidence.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;th&gt;Evidence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code output&lt;/td&gt;
&lt;td&gt;Nearly half of AI-generated code has security flaws&lt;/td&gt;
&lt;td&gt;CSET Georgetown, Veracode 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI tools&lt;/td&gt;
&lt;td&gt;8 CVEs in 3 months, 135K exposed instances&lt;/td&gt;
&lt;td&gt;OpenClaw, SecurityScorecard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;22-month undetected breach via AI tool&lt;/td&gt;
&lt;td&gt;Vercel / ShinyHunters 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And the research keeps piling up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nearly half of AI-generated code contains exploitable bugs&lt;/strong&gt;, across five major LLMs tested (&lt;a href="https://cset.georgetown.edu/publication/cybersecurity-risks-of-ai-generated-code/" rel="noopener noreferrer"&gt;CSET Georgetown, 2024&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;45% of AI-generated code has security flaws&lt;/strong&gt; across more than 100 large language models (&lt;a href="https://www.veracode.com/blog/spring-2026-genai-code-security/" rel="noopener noreferrer"&gt;Veracode, 2026&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-generated code creates 1.7 times more issues&lt;/strong&gt; than human-authored code in pull request analysis (&lt;a href="https://www.coderabbit.ai/blog/ai-vs-human-code-gen-report" rel="noopener noreferrer"&gt;CodeRabbit&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;43% of AI-generated code changes require manual debugging in production&lt;/strong&gt;, after passing QA and staging (&lt;a href="http://lightrun.com/ebooks/state-of-ai-powered-engineering-2026" rel="noopener noreferrer"&gt;Lightrun, 2026&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4x growth in duplicated code blocks&lt;/strong&gt; since AI coding tools became mainstream, suggesting copy-paste from training data without architectural judgment (&lt;a href="https://www.gitclear.com/blog/ai_copilot_code_quality_2025_data_suggests_4x_growth_in_code_clones" rel="noopener noreferrer"&gt;GitClear, 2025&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't hypothetical risks from academic papers. These are measured failure rates from deployed systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI Tools Themselves Are Getting Hacked
&lt;/h2&gt;

&lt;p&gt;It's not just the code that's the problem. The tools generating the code are under active attack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw&lt;/strong&gt;, the open-source AI agent that went viral in early 2026, has accumulated eight CVEs in just three months:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CVE&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-25253 (CVSS 8.8)&lt;/td&gt;
&lt;td&gt;One-click remote code execution, steals your auth token through WebSocket, works even on localhost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-24763&lt;/td&gt;
&lt;td&gt;Command injection through Docker sandbox PATH manipulation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-25593&lt;/td&gt;
&lt;td&gt;Unauthenticated command injection via WebSocket config write&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-26317&lt;/td&gt;
&lt;td&gt;Cross-site request forgery, no origin validation on localhost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-40037&lt;/td&gt;
&lt;td&gt;Request body replay leaking sensitive data across redirects&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://securityscorecard.com/blog/how-exposed-openclaw-deployments-turn-agentic-ai-into-an-attack-surface/" rel="noopener noreferrer"&gt;SecurityScorecard found&lt;/a&gt; &lt;strong&gt;135,000 internet-exposed OpenClaw instances&lt;/strong&gt;. Infosecurity Magazine flagged &lt;strong&gt;63% as vulnerable&lt;/strong&gt;. Over 12,800 were directly exploitable via the patched RCE, meaning they hadn't even updated. Belgium's national cybersecurity center issued an emergency advisory: patch immediately.&lt;/p&gt;

&lt;p&gt;And then there's the &lt;strong&gt;ClawHavoc campaign&lt;/strong&gt;, malicious "skills" distributed through OpenClaw's community registry, deploying information-stealing malware to developers who thought they were installing productivity tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Platform, the Agent, and the Code. All Compromised
&lt;/h2&gt;

&lt;p&gt;Here's the pattern that should concern every developer in the Philippines:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your deployment platform&lt;/strong&gt; (Vercel) got breached through an AI tool an employee used. Twenty-two months of access before anyone noticed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your AI coding agent&lt;/strong&gt; (OpenClaw) has &lt;a href="https://securityscorecard.com/blog/what-are-the-real-security-risks-of-agentic-ai-and-openclaw/" rel="noopener noreferrer"&gt;eight CVEs, 135,000 exposed instances&lt;/a&gt;, and an active malware campaign targeting its plugin ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The code your AI generates&lt;/strong&gt; has a 45% security flaw rate and 1.7 times more issues than what a human writes.&lt;/p&gt;

&lt;p&gt;The entire stack, from infrastructure to agent to output, is compromised if you don't understand what you're deploying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Vibe Coding Risks Hit the Philippines Hardest
&lt;/h2&gt;

&lt;p&gt;Vercel and Next.js are the default stack for a huge segment of Filipino developers. Bootcamp graduates, freelancers on Upwork, startup CTOs, this is the ecosystem. When Vercel gets breached, it's not a distant Silicon Valley story. It's the platform your client's app is running on.&lt;/p&gt;

&lt;p&gt;The Philippines has one of the fastest-growing developer communities in Southeast Asia. AI adoption is accelerating. But the gap between "I can prompt an AI to build an app" and "I can deploy and maintain a secure production system" is enormous. The &lt;a href="https://tokita.online/ai-consultant-philippines/" rel="noopener noreferrer"&gt;2024 data on AI adoption in the Philippines&lt;/a&gt; tells the story: 92% of organizations experimented with AI, 65% got stuck in pilot, and only 3% achieved full adoption. That gap isn't a technology problem. It's a systems thinking problem.&lt;/p&gt;

&lt;p&gt;Vibe coding in the Philippines carries an additional layer of risk: many freelancers and small dev shops are building client applications on these platforms without dedicated security teams, without infrastructure expertise, and without the budget for recovery when things go wrong.&lt;/p&gt;

&lt;p&gt;Vibe coding without systems thinking is like drawing a blueprint on paper. It looks right. It communicates the idea. But the moment it gets wet, real traffic, real attackers, real edge cases, it's destroyed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Vibe Coding: What Production Actually Requires
&lt;/h2&gt;

&lt;p&gt;I'm not arguing against AI-assisted development. I'm arguing for combining it with fundamentals that vibe coding alone will never teach you:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure.&lt;/strong&gt; Understand where your code runs. Know the difference between a serverless function and a container. Know what environment variables are and why they need rotation policies. The Vercel breach exposed credentials that developers stored in plain env vars, because the platform made it easy and nobody questioned it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardening.&lt;/strong&gt; Every deployment needs security headers, input validation, authentication checks, and rate limiting. AI-generated code &lt;a href="https://checkmarx.com/blog/security-in-vibe-coding/" rel="noopener noreferrer"&gt;suggests vulnerable patterns&lt;/a&gt; more often than secure alternatives. If you can't read the code and spot what's missing, you can't ship it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge cases and failure modes.&lt;/strong&gt; AI generates code for happy paths. Production runs on unhappy paths, connections drop, requests time out, databases lock, users do things you never imagined. The &lt;a href="http://lightrun.com/ebooks/state-of-ai-powered-engineering-2026" rel="noopener noreferrer"&gt;43% debugging-in-production rate&lt;/a&gt; exists because AI doesn't think about what happens when things go wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency auditing.&lt;/strong&gt; AI tools pull in libraries without verifying them. The ClawHavoc campaign exploited exactly this, developers installing unvetted extensions because the tool made it frictionless. Every dependency is an attack surface. This is the same pattern that makes &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;unsupervised AI agents dangerous in production&lt;/a&gt;, the absence of review loops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment pipelines.&lt;/strong&gt; If your deployment process is "push to main and Vercel handles it," you've outsourced your entire release safety to a platform that just got breached for twenty-two months. CI/CD, staging environments, rollback procedures, these exist for a reason.&lt;/p&gt;

&lt;p&gt;In the Philippines, where most dev teams are small and move fast, these fundamentals get skipped because the tooling makes it easy to skip them. That's exactly why they matter more here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Survival Engineer's Take
&lt;/h2&gt;

&lt;p&gt;I built a production AI operations system out of necessity, not as a product, but as a survival tool for running a lean startup where I wear ten hats. That system uses AI constantly. It also has enforcement hooks, anti-fabrication rules, credential rotation, deployment gates, and rollback procedures.&lt;/p&gt;

&lt;p&gt;The AI makes me faster. The systems thinking keeps me alive.&lt;/p&gt;

&lt;p&gt;Vibe coding is a tool. A good one. But if you're building your career or your company on apps that were prompted into existence without understanding what holds them together, the Vercel breach is your preview of what's coming.&lt;/p&gt;

&lt;p&gt;Learn the fundamentals. Not instead of AI. Alongside it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is vibe coding safe for production applications?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vibe coding can produce working prototypes quickly, but the research shows significant risks for production deployment. Veracode's 2026 report found that 45% of AI-generated code contains security flaws, and Lightrun's survey found that 43% of AI-generated code changes require manual debugging in production. Vibe coding is safe when combined with code review, security auditing, proper infrastructure knowledge, and deployment pipelines. Without those fundamentals, it's a liability.&lt;br&gt;
&lt;strong&gt;What happened in the Vercel breach of April 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vercel disclosed a security incident on April 19, 2026. A third-party AI tool called Context.ai was compromised, which gave attackers access to a Vercel employee's Google Workspace account. That access cascaded into Vercel's internal systems, exposing customer environment variables including API keys, tokens, and database credentials. The intrusion reportedly began in June 2024, a 22-month dwell time before detection. The threat group ShinyHunters claimed responsibility.&lt;br&gt;
&lt;strong&gt;What are the biggest security risks of AI-generated code?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The three main risk layers are: (1) the generated code itself has verified flaw rates approaching 50% across multiple studies, including SQL injection, XSS, and hardcoded credentials; (2) the AI coding tools have their own vulnerabilities. OpenClaw accumulated eight CVEs in three months with 135,000 exposed instances; and (3) the deployment platforms developers rely on are themselves targets, as the Vercel breach demonstrated.&lt;br&gt;
&lt;strong&gt;How can Filipino developers reduce vibe coding risks?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Focus on five fundamentals that vibe coding alone won't teach you: understand your infrastructure (don't treat deployment as a black box), harden every deployment (security headers, input validation, rate limiting), test edge cases and failure modes (AI codes for happy paths only), audit dependencies (every library is an attack surface), and build proper deployment pipelines (CI/CD, staging, rollback). Combine AI-assisted development with these practices, the speed of AI plus the safety of systems thinking.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Tom Tokita is an AI consultant and operations architect based in Manila, Philippines. He co-founded and runs &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, a Salesforce consulting partner. He routes between 3-5 LLMs daily in production, not demos, not POCs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Best LLM for Each Task: A Practitioner’s Reference Guide</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Mon, 27 Apr 2026 06:59:43 +0000</pubDate>
      <link>https://dev.to/tomtokita/best-llm-for-each-task-a-practitioners-reference-guide-2o06</link>
      <guid>https://dev.to/tomtokita/best-llm-for-each-task-a-practitioners-reference-guide-2o06</guid>
      <description>&lt;p&gt;&lt;strong&gt;Most AI vendors sell you one model at a flat fee. It works, until it doesn't.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the pitch: "Unlimited AI, fixed price!" Under the hood, they've slapped a single budget model on everything, your customer support bot, your code reviews, your data analysis, your document generation. It handles the simple stuff fine. Then you ask it to reason through a complex business decision, and it confidently gives you an answer that's completely wrong.&lt;/p&gt;

&lt;p&gt;You go back to the vendor. Their response? "You need to upgrade to the premium model." That's not an upgrade problem. That's a &lt;a href="https://tokita.online/llm-wrappers-what-actually-matters/" rel="noopener noreferrer"&gt;model selection&lt;/a&gt; problem, and you just paid to discover it the hard way.&lt;/p&gt;

&lt;p&gt;Choosing the best LLM for each task is an architecture decision, not a shopping decision. LLMs are not interchangeable. Each model family is built with different strengths, different architectures, and different cost profiles. Using the wrong one doesn't just waste money, it produces hallucinations, missed context, and confidently wrong outputs that kill trust in AI across your team. (New to LLMs? Start with &lt;a href="https://tokita.online/how-to-choose-the-right-ai-tool/" rel="noopener noreferrer"&gt;What Is AI, Really?&lt;/a&gt; for the fundamentals.)&lt;/p&gt;

&lt;p&gt;Full disclosure: I use Claude as my primary daily driver. Where that might bias my recommendations, I've noted alternatives and linked directly to provider docs so you can verify independently.&lt;/p&gt;

&lt;p&gt;This guide is your reference point. Bookmark it. Come back when a vendor tells you their tool "uses AI" and can't tell you which model, or why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why One LLM Doesn't Fit Every Task
&lt;/h2&gt;

&lt;p&gt;If you've ever wondered how to decide which LLM to use, the answer starts with understanding what each model was actually built for.&lt;/p&gt;

&lt;p&gt;Think of it like hiring. You wouldn't hire a junior analyst to architect your enterprise data platform. You also wouldn't hire a principal architect to sort spreadsheets, not because they can't, but because you're burning $300/hour on a $30 task.&lt;/p&gt;

&lt;p&gt;LLMs work the same way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontier models&lt;/strong&gt; (Claude Opus, GPT-5.4, Gemini 3.1 Pro) are deep thinkers. They reason through multi-step problems, hold massive context windows, and produce nuanced output. They also cost 10-50x more per token than lightweight models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mid-tier models&lt;/strong&gt; (Claude Sonnet, GPT-5.4 mini, Gemini 3 Flash) hit the sweet spot, fast enough for production, smart enough for most tasks, and priced for volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight models&lt;/strong&gt; (Claude Haiku, GPT-5.4 nano, Gemini 2.5 Flash-Lite, DeepSeek V3.2) are built for speed and cost. They're excellent at structured extraction, classification, simple Q&amp;amp;A, and high-volume processing. Ask them to architect a system or reason through ambiguity? That's where hallucinations start.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right approach is &lt;strong&gt;task routing&lt;/strong&gt;, matching each task to the model that handles it best. Your total cost drops, your quality goes up, and you stop blaming "AI" for problems that are really model mismatch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Task-Model Matrix: Best LLM for Each Task
&lt;/h2&gt;

&lt;p&gt;This is the reference table. Every recommendation comes from daily production use, cross-referenced with each provider's own documentation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Best Pick&lt;/th&gt;
&lt;th&gt;Runner-Up&lt;/th&gt;
&lt;th&gt;Why It Wins&lt;/th&gt;
&lt;th&gt;Avoid&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Complex reasoning &amp;amp; architecture&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Opus 4.6&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Extended thinking, 1M token context, multi-step logic chains&lt;/td&gt;
&lt;td&gt;Lite/Nano models, they hallucinate on multi-step reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production code generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Sonnet 4.6&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4 mini&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Fast + code-native, 64K output, strong instruction-following&lt;/td&gt;
&lt;td&gt;Budget models, inconsistent on large codebases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent orchestration &amp;amp; tool use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Opus 4.6&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://docs.x.ai/docs/models" rel="noopener noreferrer"&gt;Grok 4.20 multi-agent&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Reliable function calling, long-context planning, handles complex tool chains&lt;/td&gt;
&lt;td&gt;Any "lite" model, they lose track of multi-turn tool sequences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Content writing &amp;amp; copywriting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Sonnet 4.6&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Natural voice, strong style control, follows nuanced instructions&lt;/td&gt;
&lt;td&gt;DeepSeek, Grok fast, flat tone, poor style adaptation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data extraction &amp;amp; structured output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 3 Flash&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;DeepSeek V3.2&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Fast JSON mode, schema adherence, cheap at scale ($0.50/MTok in, $3/MTok out)&lt;/td&gt;
&lt;td&gt;Frontier models, overkill, 10x+ cost for the same result&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High-volume classification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 2.5 Flash-Lite&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4 nano&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.10/MTok input, pennies per thousand calls, fast enough for real-time&lt;/td&gt;
&lt;td&gt;Any full-size model, you're paying for intelligence you don't need&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quick Q&amp;amp;A &amp;amp; chatbots&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 2.5 Flash-Lite&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Haiku 4.5&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Sub-second latency, low cost, good enough for conversational retrieval&lt;/td&gt;
&lt;td&gt;Frontier reasoning models, latency kills UX, cost kills margin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deep research &amp;amp; analysis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Opus 4.6&lt;/a&gt; (extended thinking)&lt;/td&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 3.1 Pro&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Can reason through 1M+ token contexts, extended thinking for deliberate analysis&lt;/td&gt;
&lt;td&gt;Anything under 128K context, literally can't fit the data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Budget-conscious general use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;DeepSeek V3.2&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 2.5 Flash&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.28/MTok input, $0.42/MTok output, 10x cheaper than most competitors at reasonable quality&lt;/td&gt;
&lt;td&gt;Free tiers with rate limits, they throttle when you need them most&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every link above goes to the provider's official docs, no third-party benchmarks, no secondhand claims.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Choose the Right LLM: The Task-First Framework
&lt;/h2&gt;

&lt;p&gt;Forget "which AI is best." The right question is: &lt;strong&gt;best for what?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the framework I use across every production deployment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Define the task type first.&lt;/strong&gt; Is it reasoning, generation, extraction, or routing? Each has fundamentally different requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Match to a model tier.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Needs to &lt;em&gt;think&lt;/em&gt;? → Frontier (Opus, GPT-5.4, Gemini 3.1 Pro)&lt;/li&gt;
&lt;li&gt;Needs to &lt;em&gt;produce&lt;/em&gt;? → Mid-tier (Sonnet, GPT-5.4 mini, Gemini 3 Flash)&lt;/li&gt;
&lt;li&gt;Needs to &lt;em&gt;classify or extract&lt;/em&gt;? → Lightweight (Haiku, Nano, Flash-Lite)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Check the context window.&lt;/strong&gt; If your task involves processing documents, code repositories, or conversation histories longer than 128K tokens, most lightweight models are physically incapable of handling it. This isn't a quality issue, the data literally doesn't fit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Calculate the real cost.&lt;/strong&gt; A $5/MTok model that gets it right on the first try is cheaper than a $0.10/MTok model that needs three retries and human review. Factor in error correction, not just token price.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Test with your actual workload.&lt;/strong&gt; Benchmarks measure synthetic tasks. Your data, your prompts, your edge cases, those are what matter. Run a 100-call sample before committing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Best LLM for Coding and Development
&lt;/h2&gt;

&lt;p&gt;This is where model selection matters most, because bad code from an AI doesn't just waste tokens, it wastes developer hours debugging AI-generated bugs.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;code generation&lt;/strong&gt; in production, &lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Sonnet 4.6&lt;/a&gt; is the current leader. It handles multi-file edits, understands project context, and follows coding conventions consistently. At $3/MTok input and $15/MTok output, it's the workhorse, fast enough for iteration, smart enough for production-grade output.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;architectural decisions and complex debugging&lt;/strong&gt;, &lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Opus 4.6&lt;/a&gt; with extended thinking is the pick. The 1M token context window means it can hold an entire codebase in context. At $5/MTok input, it's expensive for bulk work, but for the tasks where getting it wrong costs days of rework, it's the cheapest option you have.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4 mini&lt;/a&gt; is a strong runner-up at $0.75/MTok input, particularly for code reviews, test generation, and structured refactoring where you need speed over depth.&lt;/p&gt;

&lt;p&gt;What doesn't work: lightweight models for code. GPT-5.4 nano and Gemini Flash-Lite will generate syntactically valid code that has subtle logic errors, the kind that pass linting but fail in production. The cost savings evaporate when your team spends hours tracking down AI-introduced bugs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Best LLM for Reasoning and Analysis
&lt;/h2&gt;

&lt;p&gt;If you're asking "which LLM is best for research," the answer depends on what kind of research.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;deep analysis&lt;/strong&gt;, parsing contracts, evaluating strategy documents, synthesizing research across hundreds of pages, you need extended thinking capabilities and large context windows. &lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Opus 4.6&lt;/a&gt; with extended thinking leads here. It doesn't just retrieve information; it reasons through it, surfacing connections and contradictions that faster models miss.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4&lt;/a&gt; at $2.50/MTok input is competitive for research tasks, especially when you need web grounding via &lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;OpenAI's built-in web search&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 3.1 Pro&lt;/a&gt; brings serious context capacity and Google's search integration, making it strong for research that needs real-time information.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;quick fact extraction&lt;/strong&gt; from structured documents, you don't need any of these. &lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 2.5 Flash&lt;/a&gt; at $0.30/MTok handles it fine. The key insight from &lt;a href="https://tokita.online/context-engineering-vs-prompt-engineering/" rel="noopener noreferrer"&gt;context engineering&lt;/a&gt; applies here: it's not just about the model, it's about what context you feed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  ChatGPT vs Claude vs Gemini: Which Is Actually Better?
&lt;/h2&gt;

&lt;p&gt;This is the most common question, and it's the wrong one. "Which is better" assumes one winner across all tasks. There isn't one.&lt;/p&gt;

&lt;p&gt;Here's the honest breakdown from production use:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Claude&lt;/th&gt;
&lt;th&gt;ChatGPT (GPT-5.4)&lt;/th&gt;
&lt;th&gt;Gemini&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strongest. Sonnet 4.6 is the daily driver&lt;/td&gt;
&lt;td&gt;GPT-5.4 mini is a close second&lt;/td&gt;
&lt;td&gt;Gemini 3 Flash is capable but less consistent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Instruction-following&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best in class, follows complex, multi-constraint prompts reliably&lt;/td&gt;
&lt;td&gt;Good, occasionally overinterprets&lt;/td&gt;
&lt;td&gt;Tends to be verbose, sometimes ignores constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Content writing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Natural, adaptable voice&lt;/td&gt;
&lt;td&gt;Solid but can lean generic&lt;/td&gt;
&lt;td&gt;Tends toward formal/corporate tone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost efficiency at scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mid-range ($1-5/MTok input)&lt;/td&gt;
&lt;td&gt;Premium to mid ($0.20-2.50/MTok input)&lt;/td&gt;
&lt;td&gt;Best value. Flash-Lite at $0.10/MTok&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context window&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1M tokens (Opus/Sonnet)&lt;/td&gt;
&lt;td&gt;Not publicly listed for 5.4&lt;/td&gt;
&lt;td&gt;Up to 1M+ (Gemini 3.1 Pro)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoning depth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opus extended thinking is top-tier&lt;/td&gt;
&lt;td&gt;GPT-5.4 is strong, less transparent&lt;/td&gt;
&lt;td&gt;Gemini 3.1 Pro competes but less tested&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Haiku is fastest in class&lt;/td&gt;
&lt;td&gt;Nano is competitive&lt;/td&gt;
&lt;td&gt;Flash-Lite wins on pure throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool use / agents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opus leads, reliable multi-tool chains&lt;/td&gt;
&lt;td&gt;Improving rapidly&lt;/td&gt;
&lt;td&gt;Strong but newer ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The point isn't that Claude wins everything (it doesn't). It's that &lt;strong&gt;each model family has tasks where it's the clear best pick and tasks where it's a waste of money.&lt;/strong&gt; The vendors who sell you one of these as "the AI solution" are leaving performance and budget on the table.&lt;/p&gt;




&lt;h2&gt;
  
  
  Best LLM for Orchestration and Multi-Agent Systems
&lt;/h2&gt;

&lt;p&gt;This is where &lt;a href="https://tokita.online/llm-wrappers-what-actually-matters/" rel="noopener noreferrer"&gt;most AI tools being just LLM wrappers&lt;/a&gt; becomes a real problem. Agent orchestration, where an AI coordinates multiple tools, APIs, and sub-tasks, requires a model that can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Maintain context across dozens of tool calls&lt;/li&gt;
&lt;li&gt;Decide which tool to use and when&lt;/li&gt;
&lt;li&gt;Handle failures and retry logic&lt;/li&gt;
&lt;li&gt;Not hallucinate tool parameters&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Lightweight models fail catastrophically here. They lose track of the conversation after 3-4 tool calls, start hallucinating function names, and make confident decisions based on context they've already forgotten.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Claude Opus 4.6&lt;/a&gt; is built for this. Anthropic explicitly positions it as "the most intelligent model for building agents." The 1M token context means it can hold the full history of a complex multi-step workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.x.ai/docs/models" rel="noopener noreferrer"&gt;Grok 4.20 multi-agent&lt;/a&gt; from xAI is a contender at $2/MTok input with a 2M token context window, the largest available, and explicit multi-agent support.&lt;/p&gt;

&lt;p&gt;The production pattern that works: &lt;strong&gt;use a frontier model as the orchestrator and lightweight models as workers.&lt;/strong&gt; The orchestrator plans and routes. The workers execute structured subtasks. Your orchestration layer uses Opus at $5/MTok for 5% of your tokens. Your workers use Flash-Lite at $0.10/MTok for the other 95%. Total cost drops while quality goes up.&lt;/p&gt;

&lt;p&gt;This is exactly what happens when &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;autonomous agents hit production&lt;/a&gt;, the architecture matters more than any single model choice.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Cost of Using the Wrong LLM
&lt;/h2&gt;

&lt;p&gt;Here's the vendor trap in action:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The pitch:&lt;/strong&gt; "Our AI platform, flat fee, unlimited usage!" Sounds great.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Under the hood:&lt;/strong&gt; A single budget-tier model running everything, customer support, document analysis, code generation, reporting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Month 1:&lt;/strong&gt; Simple tasks work fine. Customer support bot answers FAQs. Document summaries look decent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Month 2:&lt;/strong&gt; You ask it to analyze a contract for risk clauses. It misses three critical terms. You ask it to generate an integration spec. It hallucinates an API endpoint that doesn't exist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Month 3:&lt;/strong&gt; Trust erodes. Your team starts double-checking every AI output manually, which defeats the purpose.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The call:&lt;/strong&gt; "You need our premium tier." That's the upsell. The flat fee was the foot in the door.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The fix isn't a more expensive model. It's &lt;strong&gt;the right model for each task.&lt;/strong&gt; A system that routes contract analysis to Opus ($5/MTok) and FAQ responses to Flash-Lite ($0.10/MTok) costs less total than running everything on a mid-tier model, and produces better results at both ends.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Audit Your AI Vendor
&lt;/h2&gt;

&lt;p&gt;Five questions to ask before signing, or renewing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Which LLM powers each feature?&lt;/strong&gt; If they can't name the model, that's a red flag. If they say "proprietary AI," that's usually a wrapper around someone else's model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can I see the model ID in logs or API responses?&lt;/strong&gt; Transparency matters. If you're paying for GPT-5.4-level intelligence and getting Nano-level output, you should be able to verify.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What happens when a task exceeds the model's capability?&lt;/strong&gt; Do they route to a more capable model? Or does it just... hallucinate and hope you don't notice?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is there task routing or is everything on one model?&lt;/strong&gt; Single-model architectures are the "flat fee" trap. Multi-model architectures with intelligent routing are what production AI actually looks like.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the actual per-token cost vs. the flat fee?&lt;/strong&gt; Do the math. If their flat fee works out to $50/MTok effective cost and the underlying model costs $3/MTok, you're paying a 16x markup for a wrapper.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Manus Problem: When You Can't See the Model
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://manus.im" rel="noopener noreferrer"&gt;Manus&lt;/a&gt;, now owned by Meta, is the poster child for the black-box approach. It's an agent platform that takes your task and runs it. You pay credits. Something happens. You get a result.&lt;/p&gt;

&lt;p&gt;What you don't get: any visibility into which model ran your task. Was it a frontier model that reasoned through your request? Or a budget model that pattern-matched and hoped for the best? You have no way to know, no way to verify, and no way to optimize.&lt;/p&gt;

&lt;p&gt;For demos and personal experiments, that's fine. For production, where you need to explain why the AI made a specific recommendation, debug when it gets something wrong, or control costs at scale, it's a liability.&lt;/p&gt;

&lt;p&gt;This is the extreme version of the vendor trap: you're not just locked into one model. You don't even know which model you're locked into. If your AI vendor can't tell you which model powers each feature, ask yourself what else they can't tell you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Provider Quick Reference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Anthropic (Claude)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input/MTok&lt;/th&gt;
&lt;th&gt;Output/MTok&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Opus 4.6&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;$25.00&lt;/td&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;Complex reasoning, agents, architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Sonnet 4.6&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;1M&lt;/td&gt;
&lt;td&gt;Code, content, production workhorse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Haiku 4.5&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$1.00&lt;/td&gt;
&lt;td&gt;$5.00&lt;/td&gt;
&lt;td&gt;200K&lt;/td&gt;
&lt;td&gt;Fast classification, simple Q&amp;amp;A, chatbots&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://platform.claude.com/docs/en/docs/about-claude/models" rel="noopener noreferrer"&gt;Anthropic Model Documentation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAI (GPT)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input/MTok&lt;/th&gt;
&lt;th&gt;Output/MTok&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;td&gt;Professional work, deep reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4 mini&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.75&lt;/td&gt;
&lt;td&gt;$4.50&lt;/td&gt;
&lt;td&gt;Code, subagents, mid-tier tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;GPT-5.4 nano&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;td&gt;$1.25&lt;/td&gt;
&lt;td&gt;High-volume simple tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://openai.com/api/pricing/" rel="noopener noreferrer"&gt;OpenAI API Pricing&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Google (Gemini)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input/MTok&lt;/th&gt;
&lt;th&gt;Output/MTok&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 3.1 Pro&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;td&gt;$12.00&lt;/td&gt;
&lt;td&gt;Complex tasks, long-context research&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 3 Flash&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;td&gt;Data extraction, structured output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Gemini 2.5 Flash-Lite&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;td&gt;$0.40&lt;/td&gt;
&lt;td&gt;Budget classification, high-volume Q&amp;amp;A&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://ai.google.dev/pricing" rel="noopener noreferrer"&gt;Google AI Pricing&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  xAI (Grok)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input/MTok&lt;/th&gt;
&lt;th&gt;Output/MTok&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://docs.x.ai/docs/models" rel="noopener noreferrer"&gt;Grok 4.20 reasoning&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$2.00&lt;/td&gt;
&lt;td&gt;$6.00&lt;/td&gt;
&lt;td&gt;2M&lt;/td&gt;
&lt;td&gt;Advanced reasoning, multi-agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://docs.x.ai/docs/models" rel="noopener noreferrer"&gt;Grok 4-1-fast&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.20&lt;/td&gt;
&lt;td&gt;$0.50&lt;/td&gt;
&lt;td&gt;2M&lt;/td&gt;
&lt;td&gt;Quick responses, cost efficiency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://docs.x.ai/docs/models" rel="noopener noreferrer"&gt;xAI Model Documentation&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  DeepSeek
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input/MTok&lt;/th&gt;
&lt;th&gt;Output/MTok&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;DeepSeek V3.2 chat&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;td&gt;$0.42&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;Budget general use, structured output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;DeepSeek V3.2 reasoner&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;td&gt;$0.42&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;Budget reasoning with extended thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Source: &lt;a href="https://api-docs.deepseek.com/quick_start/pricing" rel="noopener noreferrer"&gt;DeepSeek API Pricing&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How do I decide which LLM to use?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start with the task, not the model. Define what you need, reasoning, code generation, data extraction, content writing, or orchestration, then match to the appropriate model tier. Use the Task-Model Matrix above as your starting point, and always test with your actual workload before committing. The "best" model is the one that handles your specific task reliably at a cost you can sustain.&lt;br&gt;
&lt;strong&gt;Which AI is best for coding?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For production code generation, Claude Sonnet 4.6 leads, fast, code-native, and reliable on multi-file edits at $3/MTok input. For complex architectural decisions and debugging, Claude Opus 4.6 with extended thinking. GPT-5.4 mini at $0.75/MTok is the best value if you need speed over depth. Avoid lightweight models (Nano, Flash-Lite) for code, they produce syntactically valid code with subtle logic errors that cost more to debug than you saved on tokens.&lt;br&gt;
&lt;strong&gt;Which LLM is best for research?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It depends on the depth. For deep analysis across hundreds of pages, Claude Opus 4.6 with extended thinking and its 1M token context window. For quick fact extraction from structured documents, Gemini 2.5 Flash at $0.30/MTok handles it fine. For research needing real-time web information, GPT-5.4 with web search or Gemini with Google Search integration.&lt;br&gt;
&lt;strong&gt;Is ChatGPT better than Claude or Gemini?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;None of them is universally "better." Claude leads on coding and instruction-following. GPT-5.4 is strong on general professional work and has the broadest tool ecosystem. Gemini wins on cost efficiency and context window size. The right answer is using each where it's strongest, which is why single-model AI solutions underperform multi-model architectures. See the full comparison table above.&lt;br&gt;
&lt;strong&gt;What is LLM task routing?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Task routing is the practice of directing different AI tasks to different models based on what each model does best. Instead of running everything on one expensive model (or one cheap model that hallucinates on complex tasks), you route reasoning to a frontier model, data extraction to a lightweight model, and code generation to a mid-tier model. Your total cost drops, quality goes up, and you stop overpaying for simple tasks or underpaying for complex ones.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;This guide reflects production experience as of March 2026. LLM pricing and capabilities change frequently. I'll update this reference as models evolve. All pricing and capability claims link to official provider documentation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I'm Tom Tokita. Co-Founder &amp;amp; President of &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, a consulting firm in Manila. I route between 3-5 LLMs daily across production deployments. Have a question about which model fits your use case? &lt;a href="https://tokita.online/contact/" rel="noopener noreferrer"&gt;Let's talk.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
