<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tom Tokita</title>
    <description>The latest articles on DEV Community by Tom Tokita (@tomtokita).</description>
    <link>https://dev.to/tomtokita</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3840091%2F5ac3193c-0dc1-496a-b6d2-a7eb6e1556e7.jpg</url>
      <title>DEV Community: Tom Tokita</title>
      <link>https://dev.to/tomtokita</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tomtokita"/>
    <language>en</language>
    <item>
      <title>Hackers Didn't Break Into Instagram. They Exposed the Biggest Agentic AI Security Risk in Production.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Wed, 03 Jun 2026 05:35:41 +0000</pubDate>
      <link>https://dev.to/tomtokita/hackers-didnt-break-into-instagram-they-exposed-the-biggest-agentic-ai-security-risk-in-4j2j</link>
      <guid>https://dev.to/tomtokita/hackers-didnt-break-into-instagram-they-exposed-the-biggest-agentic-ai-security-risk-in-4j2j</guid>
      <description>&lt;p&gt;Nobody hacked Instagram. What happened was worse: an AI chatbot security failure that let attackers walk through the front door.&lt;/p&gt;

&lt;p&gt;That needs to be the first thing you understand about what happened on June 1, 2026. There was no zero-day exploit. No SQL injection. No brute-force password cracking. Hackers &lt;a href="https://krebsonsecurity.com/2026/06/hackers-used-metas-ai-support-bot-to-seize-instagram-accounts/" rel="noopener noreferrer"&gt;used a VPN to fake their location&lt;/a&gt;, opened Meta's AI support chatbot, and asked it to change the email on someone else's account.&lt;/p&gt;

&lt;p&gt;The bot did it.&lt;/p&gt;

&lt;p&gt;It sent a verification code to the attacker's email. The attacker verified it. Then they got a password reset link. That was the entire exploit. Instructions for doing it &lt;a href="https://krebsonsecurity.com/2026/06/hackers-used-metas-ai-support-bot-to-seize-instagram-accounts/" rel="noopener noreferrer"&gt;circulated on Telegram&lt;/a&gt; within hours. High-profile accounts fell fast: the &lt;a href="https://www.404media.co/hackers-simply-asked-meta-ai-to-give-them-access-to-high-profile-instagram-accounts-it-worked/" rel="noopener noreferrer"&gt;Obama-era White House Instagram&lt;/a&gt; was defaced with pro-Iran content. The Chief Master Sergeant of the U.S. Space Force lost access. Jane Manchun Wong, a former Meta security engineer, had her &lt;a href="https://x.com/wongmjane/status/2061456887959474393" rel="noopener noreferrer"&gt;password changed without her knowledge&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Meta spokesperson Andy Stone &lt;a href="https://x.com/andymstone/status/2061486724199379186" rel="noopener noreferrer"&gt;confirmed the vulnerability was real&lt;/a&gt; and said they were "securing impacted accounts."&lt;/p&gt;

&lt;p&gt;One user on X summed it up better than any post-mortem could: "We're at the point where one AI stole it and another can't fix it, &lt;a href="https://www.bbc.com/news/articles/c98rzr72dpyo" rel="noopener noreferrer"&gt;zero humans in the loop anywhere&lt;/a&gt;."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Instagram AI Hack Exposed a Deeper Pattern of Autonomous AI Risks
&lt;/h2&gt;

&lt;p&gt;The Instagram AI hack isn't an isolated incident. It's a symptom of a deeper set of autonomous AI risks that the industry keeps ignoring. The pattern always looks the same: an AI system with too much authority, too little verification, and no human checkpoint between intent and execution.&lt;/p&gt;

&lt;p&gt;You've seen this before.&lt;/p&gt;

&lt;p&gt;OpenClaw gave dozens of autonomous agents access to OpenAI's API with no budget gates. The result was a &lt;a href="https://tokita.online/openclaw-ai-agent-cost-reality/" rel="noopener noreferrer"&gt;$1.3 million bill&lt;/a&gt; that nobody noticed until the invoice arrived. Different domain, same architecture: agents running without boundaries, consequences discovered after the damage.&lt;/p&gt;

&lt;p&gt;A startup called PocketOS gave an AI agent write access to a production database with no pre-action gate. The agent &lt;a href="https://tokita.online/ai-agent-production-safety/" rel="noopener noreferrer"&gt;deleted everything in 9 seconds&lt;/a&gt;. There was no confirmation step, no rollback trigger, no human checkpoint.&lt;/p&gt;

&lt;p&gt;Security researchers found &lt;a href="https://tokita.online/ai-supply-chain-attack-575-malicious-skills/" rel="noopener noreferrer"&gt;575 malicious AI skills&lt;/a&gt; published to open registries. Tools that looked legitimate but contained prompt injection payloads, credential harvesting, and data exfiltration. The trust model was: if it's in the registry, it's safe. Nobody verified.&lt;/p&gt;

&lt;p&gt;Four incidents. Four different consequences. One architectural failure.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Incident&lt;/th&gt;
&lt;th&gt;What Failed&lt;/th&gt;
&lt;th&gt;AI Guardrail That Prevents It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Meta Instagram AI hack&lt;/td&gt;
&lt;td&gt;No identity verification on account changes&lt;/td&gt;
&lt;td&gt;Human-in-the-loop for identity operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw $1.3M bill&lt;/td&gt;
&lt;td&gt;No token budget limits on autonomous agents&lt;/td&gt;
&lt;td&gt;Consumption governance with per-agent caps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PocketOS database deletion&lt;/td&gt;
&lt;td&gt;No pre-action gate on destructive operations&lt;/td&gt;
&lt;td&gt;Pre-action confirmation for write/delete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;575 malicious AI skills&lt;/td&gt;
&lt;td&gt;No provenance checks on tool registry&lt;/td&gt;
&lt;td&gt;Supply chain verification&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why AI Chatbot Security Fails: The Guru Dream vs. Production Reality
&lt;/h2&gt;

&lt;p&gt;The AI influencer pitch goes like this: deploy autonomous agents, remove humans from the loop, let the AI handle it. Scale your support team with chatbots. Replace your QA with agents. Automate your entire deployment pipeline. The future is autonomous everything.&lt;/p&gt;

&lt;p&gt;That pitch sounds compelling until you see what happens when it ships.&lt;/p&gt;

&lt;p&gt;Meta replaced human support staff with an AI chatbot to handle account recovery. Account recovery is one of the most sensitive operations on any platform because the person asking for access may not be the owner. Marijus Briedis, CTO of NordVPN, &lt;a href="https://www.bbc.com/news/articles/c98rzr72dpyo" rel="noopener noreferrer"&gt;put it plainly&lt;/a&gt;: when AI chatbots have "too much authority and too little verification, they can become a serious security risk."&lt;/p&gt;

&lt;p&gt;This is the meta AI vulnerability in plain language: too much authority, no verification checkpoint, no human override.&lt;/p&gt;

&lt;p&gt;The guru pitch consistently leaves this out. &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;Autonomous agents fail in production&lt;/a&gt; not because the models are bad, but because the &lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;harness is missing&lt;/a&gt;. The models will do exactly what you ask them to do. That's the problem. If you ask a chatbot to change an email address and it has the authority to do so, it will. It won't stop to wonder whether you should be making that request.&lt;/p&gt;

&lt;p&gt;The agentic AI security risks aren't theoretical. They're the documented, repeated consequence of deploying AI systems without gates.&lt;/p&gt;

&lt;h2&gt;
  
  
  If Meta's AI Vulnerability Exposed Millions, What About Your AI Agents?
&lt;/h2&gt;

&lt;p&gt;Meta is &lt;a href="https://www.bbc.com/news/articles/c98rzr72dpyo" rel="noopener noreferrer"&gt;one of the most valuable tech companies on the planet&lt;/a&gt;. They employ some of the best security engineers in the world. They have red teams, bug bounties, and incident response playbooks that most organizations can only dream about.&lt;/p&gt;

&lt;p&gt;And their AI support chatbot was tricked with a VPN and a politely worded request.&lt;/p&gt;

&lt;p&gt;Now think about the solo developer who watched a YouTube tutorial on building AI agents last month. Someone who learned to &lt;a href="https://tokita.online/vibe-coding-risks-vercel-breach/" rel="noopener noreferrer"&gt;vibe code&lt;/a&gt; an LLM into an API, built a prototype over a weekend, showed it to a client, and is now planning to deploy it. No pre-action gate. No human-in-the-loop for sensitive operations. No &lt;a href="https://tokita.online/context-engineering-vs-prompt-engineering/" rel="noopener noreferrer"&gt;context engineering&lt;/a&gt; to constrain what the agent can access. No token budget to limit runaway costs. No drift detection to catch when the agent starts behaving differently from what was intended.&lt;/p&gt;

&lt;p&gt;That developer isn't negligent. They just never learned the fundamentals because the fundamentals aren't what gets amplified. The conference talks are about what AI can do, not what it shouldn't be allowed to do unsupervised.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Guardrails That Would Have Stopped Every Incident in This Article
&lt;/h2&gt;

&lt;p&gt;This isn't a "don't use AI" argument. AI agents are powerful tools. I run multiple AI systems in production daily and they do real work. But they work because they run inside a &lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;harness&lt;/a&gt; with mechanical constraints, not because they're trustworthy by default.&lt;/p&gt;

&lt;p&gt;Here's a list of AI guardrails that would have stopped every incident above. None of these are new. They've just been drowned out by hype.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pre-action gates.&lt;/strong&gt; Every sensitive operation needs a verification step before execution. &lt;a href="https://tokita.online/ai-agent-pre-action-gate-tutorial/" rel="noopener noreferrer"&gt;Here's how to build one&lt;/a&gt;. Account changes, data deletion, financial transactions, deployment commands. None of these should execute on a single request without verification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-loop for identity operations.&lt;/strong&gt; If a process determines who has access to what, a human must be in the decision chain. This isn't optional. Meta learned this the hard way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context boundaries.&lt;/strong&gt; An AI agent should only access what it needs for the current task. Meta's support bot had write access to email addresses on any account. That's an authorization failure before it's an AI failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumption governance.&lt;/strong&gt; &lt;a href="https://tokita.online/tokenmaxxing-enterprise-ai-cost-crisis/" rel="noopener noreferrer"&gt;Token costs are real&lt;/a&gt; and compound fast. Budget caps, per-agent limits, and alert thresholds aren't overhead. They're infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply chain verification.&lt;/strong&gt; Every tool, plugin, and skill in your agent's registry needs provenance checks. Trusting by default &lt;a href="https://tokita.online/ai-supply-chain-attack-575-malicious-skills/" rel="noopener noreferrer"&gt;is the new attack surface&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift detection.&lt;/strong&gt; Agents change behavior as models update, prompts shift, and context windows compress. If you aren't monitoring for behavioral drift, you won't know your system has degraded until a user tells you. Or until it shows up on X.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The gurus will tell you these are easy to implement. They aren't. Each one takes real iteration: building the gate, testing it against actual edge cases, discovering the scenarios you didn't anticipate, and testing again. Automated test suites catch regressions. They don't catch the moment an AI agent interprets a legitimate-looking request in a way no one predicted. These are critical security functions. They need human eyes, human judgment, and human testing before they go anywhere near production. Over-reliance on agentic automation to validate agentic automation is how you end up right back where Meta started.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agentic AI Security Risks Are Architectural, Not Theoretical
&lt;/h2&gt;

&lt;p&gt;Every incident in this article was preventable. Not with better models. Not with bigger budgets. With fundamentals that take days to learn and hours to implement.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://tokita.online/why-multi-agent-ai-fails/" rel="noopener noreferrer"&gt;multi-agent swarm pitch&lt;/a&gt; will keep getting recycled. The next AI chatbot vulnerability will happen. Another startup will give an agent write access to something it shouldn't have. These aren't predictions. They're extrapolations from a pattern that hasn't changed.&lt;/p&gt;

&lt;p&gt;Agentic AI security risks are architectural problems. They don't get solved by better prompts or smarter models. They get solved by &lt;a href="https://tokita.online/best-llm-for-each-task/" rel="noopener noreferrer"&gt;choosing the right tool for the job&lt;/a&gt;, constraining what that tool can do, and building the verification layers that keep it honest.&lt;/p&gt;

&lt;p&gt;The industry doesn't need more autonomous AI demos. It needs practitioners who understand agentic AI security risks before they build the first agent. People who've read about the failures and internalized the architecture that prevents them.&lt;/p&gt;

&lt;p&gt;If you're building AI systems, start with the constraints. The capabilities are easy. The guardrails aren't optional. They're the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What are agentic AI security risks?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agentic AI security risks are the vulnerabilities that emerge when AI systems have execution authority without verification checkpoints. They include unauthorized actions (Meta's chatbot changing emails without identity verification), uncontrolled spending (OpenClaw's &lt;a href="https://tokita.online/openclaw-ai-agent-cost-reality/" rel="noopener noreferrer"&gt;$1.3M bill&lt;/a&gt; from ungoverned agents), data destruction (&lt;a href="https://tokita.online/ai-agent-production-safety/" rel="noopener noreferrer"&gt;PocketOS's 9-second database deletion&lt;/a&gt;), and supply chain poisoning (&lt;a href="https://tokita.online/ai-supply-chain-attack-575-malicious-skills/" rel="noopener noreferrer"&gt;575 malicious AI skills&lt;/a&gt; in open registries).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What AI guardrails should developers implement?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At minimum: &lt;a href="https://tokita.online/ai-agent-pre-action-gate-tutorial/" rel="noopener noreferrer"&gt;pre-action gates&lt;/a&gt; on sensitive operations, human-in-the-loop for identity and access decisions, &lt;a href="https://tokita.online/context-engineering-vs-prompt-engineering/" rel="noopener noreferrer"&gt;context boundaries&lt;/a&gt; that limit what an agent can reach, &lt;a href="https://tokita.online/tokenmaxxing-enterprise-ai-cost-crisis/" rel="noopener noreferrer"&gt;consumption governance&lt;/a&gt; with per-agent token budgets, supply chain verification for all tools and plugins, and behavioral drift detection. These aren't advanced techniques. They're fundamentals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How did hackers exploit Meta's AI chatbot on Instagram?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Attackers used a VPN to spoof the account holder's location, then asked Meta's AI support assistant to link a new email to the target account. The chatbot &lt;a href="https://krebsonsecurity.com/2026/06/hackers-used-metas-ai-support-bot-to-seize-instagram-accounts/" rel="noopener noreferrer"&gt;complied without verifying identity&lt;/a&gt;, sent a verification code to the attacker's email, and enabled a password reset. No technical exploit was required. The AI had the authority to make account changes and no &lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;guardrail&lt;/a&gt; to stop it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can autonomous AI agents be deployed safely?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, but only with the right &lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;harness architecture&lt;/a&gt;. The problem isn't autonomy itself. It's autonomy without constraints. &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;Autonomous agents fail in production&lt;/a&gt; when they're given authority without verification gates, budget limits, or human oversight on sensitive operations. Build the constraints first, then add capabilities.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tom Tokita is the president of Aether Global Technology Inc. and builds production AI operations systems that route between multiple LLMs daily. He writes about what works and what breaks at &lt;a href="https://tokita.online" rel="noopener noreferrer"&gt;tokita.online&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>programming</category>
      <category>devops</category>
    </item>
    <item>
      <title>Tokenmaxxing Is a Symptom. Here's the Disease Every Enterprise Is Ignoring.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Thu, 28 May 2026 07:21:54 +0000</pubDate>
      <link>https://dev.to/tomtokita/tokenmaxxing-is-a-symptom-heres-the-disease-every-enterprise-is-ignoring-44f4</link>
      <guid>https://dev.to/tomtokita/tokenmaxxing-is-a-symptom-heres-the-disease-every-enterprise-is-ignoring-44f4</guid>
      <description>&lt;p&gt;NVIDIA's vice president of applied deep learning, Bryan Catanzaro, said something in an &lt;a href="https://www.techspot.com/news/112209-ai-compute-costs-getting-high-they-starting-rival.html" rel="noopener noreferrer"&gt;Axios interview in April 2026&lt;/a&gt; that should have stopped every enterprise AI roadmap cold:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"For my team, the cost of compute is far beyond the costs of the employees."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is not a critic talking. That is the VP of the company selling the chips that power every AI datacenter on the planet. When NVIDIA's own leadership admits compute outweighs payroll, the "AI will save you money" narrative has a problem.&lt;/p&gt;

&lt;p&gt;But most companies missed the signal. They were too busy tokenmaxxing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Microsoft Pulled the Plug on Claude Code
&lt;/h2&gt;

&lt;p&gt;In May 2026, Microsoft &lt;a href="https://www.windowscentral.com/microsoft/microsoft-cancels-claude-code-licenses-shifting-developers-to-github-copilot-cli-a-move-likely-driven-by-financial-motives" rel="noopener noreferrer"&gt;began cancelling the majority of its internal Claude Code licenses&lt;/a&gt;, redirecting thousands of engineers to GitHub Copilot CLI instead. The reversal came six months after the company opened broad access to Claude Code across its Experiences + Devices division, the group responsible for Windows, Microsoft 365, Outlook, Teams, and Surface.&lt;/p&gt;

&lt;p&gt;Adoption was fast. Engineers, project managers, and designers embraced it for prototyping and development. The problem wasn't the tool. It was token-based pricing at enterprise scale with no consumption governance. Monthly bills became unpredictable and high enough to trigger a fiscal-year-end pullback.&lt;/p&gt;

&lt;p&gt;Microsoft's $5 billion Foundry deal with Anthropic and Anthropic's $30 billion Azure compute commitment both remain intact. Not a relationship break. A cost-control correction.&lt;/p&gt;

&lt;p&gt;A company with functionally unlimited resources still could not absorb uncapped AI token spend across thousands of users. That should tell you something.&lt;/p&gt;

&lt;h2&gt;
  
  
  Uber Burned Its Entire 2026 AI Budget by April
&lt;/h2&gt;

&lt;p&gt;Uber's CTO, Praveen Neppalli Naga, &lt;a href="https://www.forbes.com/sites/janakirammsv/2026/05/17/uber-burns-its-2026-ai-budget-in-four-months-on-claude-code/" rel="noopener noreferrer"&gt;confirmed to The Information&lt;/a&gt; in April 2026 that the company had exhausted its entire annual AI coding tools budget in four months. Claude Code was rolled out in December 2025. Adoption climbed from 32% of engineers in February to &lt;a href="https://www.forbes.com/sites/janakirammsv/2026/05/17/uber-burns-its-2026-ai-budget-in-four-months-on-claude-code/" rel="noopener noreferrer"&gt;84% classified as agentic coding users by March&lt;/a&gt;. By spring, 95% were using AI tools monthly, roughly 70% of committed code originated from those tools, and 11% of live backend updates were written by agents with no human in the loop.&lt;/p&gt;

&lt;p&gt;The per-engineer cost: &lt;a href="https://www.forbes.com/sites/janakirammsv/2026/05/17/uber-burns-its-2026-ai-budget-in-four-months-on-claude-code/" rel="noopener noreferrer"&gt;$150 to $250 per month on average&lt;/a&gt;, with power users running between $500 and $2,000. Naga himself reported spending $1,200 in a two-hour demo session. The tool didn't fail. Engineers didn't misuse it. They used it for exactly the workloads it was designed to handle. From a productivity standpoint the rollout was a success. From a finance standpoint it was a runaway.&lt;/p&gt;

&lt;p&gt;Uber compounded the dynamic by &lt;a href="https://www.forbes.com/sites/janakirammsv/2026/05/17/uber-burns-its-2026-ai-budget-in-four-months-on-claude-code/" rel="noopener noreferrer"&gt;ranking engineers on internal leaderboards&lt;/a&gt; based on Claude Code usage. That created a cultural incentive to consume more tokens. The teams driving adoption were not the same teams managing the spend.&lt;/p&gt;

&lt;p&gt;They measured who was using AI. They never measured what it cost per unit of output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tokenmaxxing: When the Metric Becomes the Game
&lt;/h2&gt;

&lt;p&gt;The term "tokenmaxxing" describes employees running trivial or unnecessary tasks through AI tools to inflate their usage numbers. Amazon employees &lt;a href="https://futurism.com/artificial-intelligence/amazon-quotas-ai-use" rel="noopener noreferrer"&gt;admitted to the practice&lt;/a&gt; in May 2026 after the company set internal AI usage targets and tracked consumption through leaderboards. Workers reported feeling pressure to hit token quotas, even though Amazon publicly stated the numbers would not factor into performance reviews.&lt;/p&gt;

&lt;p&gt;At Meta, the same dynamic played out through an internal tracking tool called "Claudeonomics," which ranked employees by their AI token consumption. The leaderboard reportedly &lt;a href="https://fortune.com/2026/04/09/meta-killed-employee-ai-token-dashboard/" rel="noopener noreferrer"&gt;showed 60 trillion tokens consumed in a 30-day period&lt;/a&gt; before Meta killed it after media coverage.&lt;/p&gt;

&lt;p&gt;This is Goodhart's Law in real time. The moment token consumption became a tracked metric, it stopped being a useful measure of anything. Employees optimized for the number, not for the work the number was supposed to represent.&lt;/p&gt;

&lt;p&gt;Tokenmaxxing isn't an employee behavior problem. It is a governance design failure. If you measure consumption without measuring value, you get consumption without value.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Goldman Sachs Math That Should Scare Every CFO
&lt;/h2&gt;

&lt;p&gt;Goldman Sachs &lt;a href="https://www.goldmansachs.com/insights/articles/ai-agents-forecast-to-boost-tech-cash-flow-as-usage-soars" rel="noopener noreferrer"&gt;published a research report&lt;/a&gt; forecasting that agentic AI will drive a 24-fold increase in global token consumption by 2030, reaching 120 quadrillion tokens per month. Their breakdown: a standard chatbot consumes roughly 1,000 tokens per session. An embedded copilot uses over 5,000 tokens per day. A continuously active autonomous agent burns through 100,000 or more tokens per day.&lt;/p&gt;

&lt;p&gt;NVIDIA CEO Jensen Huang has said he expects &lt;a href="https://businesschief.com/news/jensen-huang-nvidia-will-have-100-ai-agents-for-each-worker" rel="noopener noreferrer"&gt;100 AI agents working alongside every human employee&lt;/a&gt; at NVIDIA by 2036.&lt;/p&gt;

&lt;p&gt;Do the multiplication. 100 agents per employee, at 100,000 tokens per day per agent, is 10 million tokens per employee per day. Multiply that by any mid-size engineering team and the numbers become absurd before you even discuss pricing.&lt;/p&gt;

&lt;p&gt;Gartner projects that by 2030, inference costs on a one-trillion-parameter model will be &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2026-03-25-gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-percent-less-than-in-2025" rel="noopener noreferrer"&gt;over 90% cheaper than in 2025&lt;/a&gt;. But their own analyst, Will Sommer, &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2026-03-25-gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-percent-less-than-in-2025" rel="noopener noreferrer"&gt;cautioned&lt;/a&gt;: "Chief Product Officers should not confuse the deflation of commodity tokens with the democratization of frontier reasoning." Agentic models require 5 to 30 times more tokens per task than standard models. Consumption growth will outpace falling unit costs. And AI providers are not going to pass through the full savings.&lt;/p&gt;

&lt;p&gt;Cheaper tokens, more tokens per task, exploding number of tasks. The bill goes up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern Is Obvious. The Fix Is Not Complicated.
&lt;/h2&gt;

&lt;p&gt;Microsoft, Uber, Amazon, Meta. Four of the most technically sophisticated companies on earth. All hit the same wall. The pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Executive mandate pushes broad AI adoption&lt;/li&gt;
&lt;li&gt;Leaderboards or usage metrics track consumption volume&lt;/li&gt;
&lt;li&gt;No mechanism ties consumption to business value&lt;/li&gt;
&lt;li&gt;Token-based pricing creates unpredictable, escalating costs&lt;/li&gt;
&lt;li&gt;Budget blowout triggers reactive pullback or cancellation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The disease is not AI. The disease is adoption without governance. No consumption gates, no cost ceilings, and no way to tie a token to a deliverable.&lt;/p&gt;

&lt;p&gt;I &lt;a href="https://tokita.online/ai-agent-pre-action-gate-tutorial/" rel="noopener noreferrer"&gt;wrote about pre-action gates&lt;/a&gt; and &lt;a href="https://tokita.online/ai-agent-production-safety/" rel="noopener noreferrer"&gt;agent production safety&lt;/a&gt; months before these headlines. The principle is the same whether you are running 100 Codex agents like &lt;a href="https://tokita.online/openclaw-ai-agent-cost-reality/" rel="noopener noreferrer"&gt;OpenClaw's $1.3 million month&lt;/a&gt; or deploying Claude Code across 10,000 engineers. If there is no gate between the request and the spend, the spend wins.&lt;/p&gt;

&lt;p&gt;The companies that will survive the agentic era are not the ones that adopt fastest. They are the ones that &lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;build harnesses&lt;/a&gt; before they build agents. Measure output, not tokens. Set cost ceilings per user, per team, per task category. Attribute consumption to deliverables, not leaderboard positions.&lt;/p&gt;

&lt;p&gt;Tokenmaxxing is what happens when you skip that step.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>enterprise</category>
      <category>governance</category>
      <category>tokenmaxxing</category>
    </item>
    <item>
      <title>OpenClaw's $1.3 Million OpenAI Bill: What AI Agents Actually Cost in Production</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Thu, 21 May 2026 00:51:27 +0000</pubDate>
      <link>https://dev.to/tomtokita/openclaws-13-million-openai-bill-what-ai-agents-actually-cost-in-production-3d9o</link>
      <guid>https://dev.to/tomtokita/openclaws-13-million-openai-bill-what-ai-agents-actually-cost-in-production-3d9o</guid>
      <description>&lt;p&gt;Peter Steinberger spent a decade building &lt;a href="https://pspdfkit.com/" rel="noopener noreferrer"&gt;PSPDFKit&lt;/a&gt; into a PDF framework running on over a billion devices. He &lt;a href="https://steipete.me/posts/2026/openclaw" rel="noopener noreferrer"&gt;joined OpenAI in February 2026&lt;/a&gt;, saying "I want to change the world, not build a large company." A few months later, his open-source project &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt;, the fastest-growing project in GitHub history with over 300,000 stars and 3.2 million users, racked up an OpenAI bill of &lt;a href="https://thenextweb.com/news/openclaw-peter-steinberger-1-3-million-openai-token-bill" rel="noopener noreferrer"&gt;$1,305,088.81 in a single month&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;603 billion tokens. 7.6 million API requests. 100 Codex agents running simultaneously. The OpenClaw cost breakdown is the first real look at what autonomous AI agents cost in production.&lt;/p&gt;

&lt;p&gt;That's $13,000 per agent per month.&lt;/p&gt;

&lt;p&gt;And OpenAI is covering the bill as a "research investment." Regular companies don't get that deal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The OpenClaw Cost Breakdown
&lt;/h2&gt;

&lt;p&gt;OpenClaw is a &lt;a href="https://github.com/openclaw/openclaw" rel="noopener noreferrer"&gt;self-hosted autonomous AI assistant&lt;/a&gt;. It connects to your email, calendar, browser, Slack, Discord, WhatsApp, and iMessage. Agents execute shell commands, manage files, automate web tasks through a growing &lt;a href="https://github.com/openclaw/clawhub" rel="noopener noreferrer"&gt;skill registry&lt;/a&gt;. The 100 agents running on Steinberger's setup were doing real work. Reviewing pull requests, scanning commits for security vulnerabilities, deduplicating GitHub issues, writing and submitting fixes, monitoring performance benchmarks, even attending meetings and generating feature PRs.&lt;/p&gt;

&lt;p&gt;This wasn't a demo. This was production. The distinction matters, because every guru demo stops before the billing cycle starts.&lt;/p&gt;

&lt;p&gt;The primary model was GPT-5.5 running in Fast Mode, which consumed tokens at higher rates. Steinberger noted that &lt;a href="https://thenextweb.com/news/openclaw-peter-steinberger-1-3-million-openai-token-bill" rel="noopener noreferrer"&gt;disabling Fast Mode would drop the bill to roughly $300,000 per month&lt;/a&gt;. A 70% reduction. Still $3,000 per agent per month at the "optimized" rate. Still $3.6 million annually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters More Than the Headline
&lt;/h2&gt;

&lt;p&gt;The headline number is dramatic, but the per-agent cost is the real story.&lt;/p&gt;

&lt;p&gt;$13,000 per month per agent on full pricing. $3,000 per month on optimized pricing. These aren't projections from a whitepaper. These are invoiced numbers from someone who works at OpenAI running agents on OpenAI's own infrastructure.&lt;/p&gt;

&lt;p&gt;Now think about the gap between Steinberger and a newcomer. He's an experienced engineer who built billion-device software. He has OpenAI's internal knowledge. He has a "research investment" subsidy covering the bill. He knows to disable Fast Mode for a 70% cost reduction.&lt;/p&gt;

&lt;p&gt;A first-time builder doesn't know any of that. They'll hit the high-rate pricing, run agents longer than necessary, retry failed calls without cost caps, and discover the bill at the end of the month. If Steinberger's optimized setup costs $3,000 per agent, a newcomer's unoptimized setup will cost more. Possibly much more.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Guru Problem
&lt;/h2&gt;

&lt;p&gt;Scroll through YouTube and LinkedIn right now. "Deploy AI agents for your business." "Build an autonomous AI workforce." "Replace your team with agents." The pitch is seductive. Agents are cheap, they scale, they work while you sleep.&lt;/p&gt;

&lt;p&gt;Nobody mentions $13,000 per month per agent.&lt;/p&gt;

&lt;p&gt;Nobody mentions that 100 agents running GPT-5.5 burn through 603 billion tokens in 30 days. Nobody mentions that "Fast Mode" isn't just faster, it's dramatically more expensive. And nobody talks about how even the optimized version, built by someone who works at the company that makes the model, still costs $3.6 million per year.&lt;/p&gt;

&lt;p&gt;The gap between what's being sold and what's being spent is the widest I've seen in tech. And it's widest for the people with the least ability to absorb the surprise. Small businesses, indie developers, and first-time builders who took the guru at their word.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Practitioners Already Knew
&lt;/h2&gt;

&lt;p&gt;I &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;wrote about this months ago&lt;/a&gt;. Autonomous AI agents look great in demos and burn cash in production. The OpenClaw numbers validate what practitioners already knew. The question was never "can agents do the work?" It was always "can you afford to let them?"&lt;/p&gt;

&lt;p&gt;When I build AI systems, cost control isn't an afterthought. It's architecture. It's why &lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;harness engineering&lt;/a&gt; exists as a discipline. The &lt;a href="https://tokita.online/claude-code-mcp-server-persistent-memory/" rel="noopener noreferrer"&gt;memory server I run&lt;/a&gt; has a condensation layer specifically because raw search results were burning through the context window. Hundreds of thousands of characters of raw output compressed to a few thousand. That's not clever engineering. That's survival. Without it, every session would have been its own version of Steinberger's bill, just at a smaller scale.&lt;/p&gt;

&lt;p&gt;I co-founded &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt;, a Salesforce consulting partner in Manila. When clients ask about AI agent deployment, the first conversation isn't about what the agent can do. It's about what the agent will cost per month, and what happens when it runs unsupervised for a weekend.&lt;/p&gt;

&lt;p&gt;Most agent frameworks ship without cost caps, token budgets, or kill switches. The &lt;a href="https://tokita.online/why-multi-agent-ai-fails/" rel="noopener noreferrer"&gt;agent swarming piece I wrote&lt;/a&gt; covers why multi-agent coordination fails in production. The OpenClaw bill is what that failure looks like in dollars.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Math
&lt;/h2&gt;

&lt;p&gt;Let's do the math the gurus won't.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Annual Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 agent (full pricing)&lt;/td&gt;
&lt;td&gt;$13,000&lt;/td&gt;
&lt;td&gt;$156,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 agent (optimized)&lt;/td&gt;
&lt;td&gt;$3,000&lt;/td&gt;
&lt;td&gt;$36,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 agents (optimized)&lt;/td&gt;
&lt;td&gt;$30,000&lt;/td&gt;
&lt;td&gt;$360,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 agents (optimized)&lt;/td&gt;
&lt;td&gt;$300,000&lt;/td&gt;
&lt;td&gt;$3,600,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 agents (full pricing)&lt;/td&gt;
&lt;td&gt;$1,300,000&lt;/td&gt;
&lt;td&gt;$15,600,000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For context, the median annual salary for a software engineer in the Philippines is &lt;a href="https://www.payscale.com/research/PH/Job=Software_Engineer/Salary" rel="noopener noreferrer"&gt;roughly $15,000-20,000&lt;/a&gt;. One unoptimized AI agent costs the same as a full-time senior developer. Ten agents cost more than a small engineering team.&lt;/p&gt;

&lt;p&gt;"Replace your team with agents" stops sounding cheap when you do the multiplication.&lt;/p&gt;

&lt;h2&gt;
  
  
  What To Actually Do
&lt;/h2&gt;

&lt;p&gt;The OpenClaw bill has four lessons that matter for anyone considering AI agents for real work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Know your token economics before you deploy.&lt;/strong&gt; Steinberger discovered that Fast Mode was the primary cost driver. That's a setting. One toggle. 70% cost difference. If you don't understand your pricing tier, your model's token consumption pattern, and your request volume, you're deploying blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build cost controls into the architecture.&lt;/strong&gt; Token budgets per agent, spend thresholds that trigger alerts or kill switches, session caps, retry limits. These aren't features you add later. They're load-bearing walls. I wrote a &lt;a href="https://tokita.online/ai-agent-pre-action-gate-tutorial/" rel="noopener noreferrer"&gt;tutorial on building pre-action gates&lt;/a&gt; for exactly this kind of mechanical enforcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with one agent, not a swarm.&lt;/strong&gt; Steinberger ran 100 agents because he could afford to (OpenAI was paying). You can't. One agent, measured, monitored, optimized. Then scale. The &lt;a href="https://tokita.online/ai-agent-production-safety/" rel="noopener noreferrer"&gt;architecture that prevents AI agents from taking destructive actions&lt;/a&gt; starts with one agent and one set of gates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Question the subsidy.&lt;/strong&gt; OpenAI covering Steinberger's bill as "research investment" means these costs aren't sustainable at market rates. When your favorite guru says "just deploy agents," ask who's paying the token bill. If the answer involves investor subsidies or promotional pricing, the real cost is being hidden, not eliminated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How much does it cost to run an AI agent in production?
&lt;/h3&gt;

&lt;p&gt;Based on OpenClaw's published numbers, a single autonomous AI agent running GPT-5.5 costs approximately $13,000 per month at full pricing, or $3,000 per month with optimized settings (disabling Fast Mode). Actual costs depend on the model, token consumption patterns, and whether cost controls like retry limits and session caps are in place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why are AI agent costs so high?
&lt;/h3&gt;

&lt;p&gt;AI agents make many API calls per task, each consuming tokens. OpenClaw's 100 agents generated 7.6 million API requests and consumed 603 billion tokens in 30 days. Unlike a chatbot conversation, an autonomous agent running continuously accumulates token costs around the clock. Fast Mode and retry loops multiply these costs further.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can you reduce AI agent costs?
&lt;/h3&gt;

&lt;p&gt;Yes. Steinberger noted that disabling Fast Mode alone reduced costs by 70%. Other strategies include setting token budgets per agent, implementing spend thresholds with kill switches, routing mechanical tasks to cheaper models instead of running everything on frontier-tier pricing, and starting with a single agent before scaling.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Claude Code Forgets Everything. So I Built It a Memory Server.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Tue, 19 May 2026 11:37:15 +0000</pubDate>
      <link>https://dev.to/tomtokita/claude-code-forgets-everything-so-i-built-it-a-memory-server-581n</link>
      <guid>https://dev.to/tomtokita/claude-code-forgets-everything-so-i-built-it-a-memory-server-581n</guid>
      <description>&lt;p&gt;Everyone's building AI agents. Almost nobody is building memory for them.&lt;/p&gt;

&lt;p&gt;The default Claude Code experience is this: you open a session, you do great work, you close the session, and it's gone. No Claude Code MCP server ships with the product to fix this. Next morning, you open a new session and explain the same project structure, the same deployment rules, the same "don't push to production without checking the allowlist" that you've explained every day this week. Claude is brilliant. Claude is also an amnesiac.&lt;/p&gt;

&lt;p&gt;At one project, that's annoying. Across a live client portfolio, it's a wall. I was burning the first ten minutes of every session on logistics that the system already knew and forgot. Same overview. Same rules. Same warnings. The AI equivalent of training a new hire every morning.&lt;/p&gt;

&lt;p&gt;So I stopped accepting the default and built a custom Claude Code MCP server with persistent memory. What started as a quick fix turned into the core of how I work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP Is (And What It Isn't)
&lt;/h2&gt;

&lt;p&gt;MCP (Model Context Protocol) lets you give Claude tools it doesn't ship with. You run a server, Claude connects to it, your server exposes capabilities that Claude calls during a session. Anthropic's &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;docs&lt;/a&gt; cover setup.&lt;/p&gt;

&lt;p&gt;This post isn't about the plumbing. It's about what you build once the plumbing works, and why the interesting problems start after "hello world."&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Claude Code MCP Server Actually Needs (And Why)
&lt;/h2&gt;

&lt;p&gt;My server gives Claude four things it doesn't have by default: persistent memory, context condensation, delegated file reading, and compliance checking. I didn't design any of this upfront. Something broke, I fixed it, and the fix became a feature.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory That Survives Between Sessions
&lt;/h3&gt;

&lt;p&gt;This came first, because the pain was loudest. I needed Claude to remember things across sessions: project configurations, platform quirks I'd spent hours debugging, deployment rules that came from three broken deploys and a near-miss on a production org.&lt;/p&gt;

&lt;p&gt;The server indexes all of that into a vector database. Thousands of knowledge chunks, searchable by meaning and by keyword. When Claude starts a session, the first thing it does is search the memory server. The rule is simple: check what you already know before guessing.&lt;/p&gt;

&lt;p&gt;I use hybrid search. Vector similarity finds conceptually related content. Keyword search catches exact terms. Neither alone is reliable, and I learned that the hard way. Semantic-only search kept returning adjacent results that missed the specific command or config value I needed. Adding keyword matching fixed the retrieval quality problems, but only after weeks of wondering why search felt "close but wrong."&lt;/p&gt;

&lt;h3&gt;
  
  
  Condensation (The Problem Nobody Warns You About)
&lt;/h3&gt;

&lt;p&gt;When your memory server works too well, it returns too much.&lt;/p&gt;

&lt;p&gt;One operation was returning over 200,000 characters of raw project context. That payload literally couldn't fit in the tool response. Claude would choke before reading a single result. Your memory server becomes a liability the moment it knows more than the context window can hold.&lt;/p&gt;

&lt;p&gt;The fix was a condenser. Results pass through a lighter model before reaching Claude. That model reads the full output and returns a distilled summary. Two hundred thousand characters compress down to a few thousand. Claude gets the answer without the bloat.&lt;/p&gt;

&lt;p&gt;If you're building an MCP server and you don't have a condensation layer, you'll hit this wall the moment your knowledge base grows past a few hundred entries. I know because I ran without one for weeks and couldn't figure out why sessions were getting slower and dumber. The condenser was the fourth thing I built. It should have been the first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Delegated Reading (Keep the Context Window Clean)
&lt;/h3&gt;

&lt;p&gt;Claude's context window is finite. Every file it reads directly consumes capacity that could be used for reasoning. Big file loads are expensive, and the cost isn't dollars. It's degraded output quality three tool calls later, because the window is stuffed with a 2,000-line config file that Claude only needed two lines from.&lt;/p&gt;

&lt;p&gt;So I built a reader. A lighter model scans the file and answers specific questions, returning cited answers with line numbers. Claude asks "what are the deployment rules for this project?" and gets back a sourced answer without loading the entire document.&lt;/p&gt;

&lt;p&gt;Same principle for writing. Mechanical work (session logs, documentation updates, structured captures) gets delegated to a cheaper model. Claude focuses on reasoning. The formatting happens elsewhere. You don't pay senior rates for data entry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compliance Checking (Because Prompts Drift)
&lt;/h3&gt;

&lt;p&gt;This one came from the most painful failure. I needed Claude to validate proposed actions against rules before executing. Not a prompt instruction. Not "please remember to check the allowlist." Prompts get compressed. Prompts get forgotten. A prompt is a suggestion. A gate is a wall.&lt;/p&gt;

&lt;p&gt;The server accepts a proposed action, checks it against predefined rules, and returns pass or fail. The difference between asking someone to remember a checklist and bolting that checklist to the door so they can't walk through without completing it.&lt;/p&gt;

&lt;p&gt;If you've ever told an AI "don't do X" and then watched it do X forty-five minutes later after a long conversation, you understand why mechanical enforcement exists. The model didn't disobey. It forgot. Forgetting and disobeying look identical from the outside, but only one of them is fixable with infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Session Loading: How Much Context Is Too Much?
&lt;/h2&gt;

&lt;p&gt;Once you have persistent memory, you face a new question. How much do you load at session start?&lt;/p&gt;

&lt;p&gt;Load everything, and you burn half the context window on background knowledge before the session begins. Load nothing, and you're back to square one.&lt;/p&gt;

&lt;p&gt;I built a tiered loader. One call returns exactly what Claude needs. First tier: core rules, security protocols, workflow constraints. Always loaded, always lean. Second tier: project-specific context. Only loaded when relevant. Both tiers pass through the condenser before returning, so the loaded context is measured in thousands of characters, not hundreds of thousands.&lt;/p&gt;

&lt;p&gt;Claude starts every session knowing the rules, the recent project activity, and what happened yesterday. One call. Under a second. No re-explaining.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Claude Code Loses Memory Mid-Session (And How to Fix It)
&lt;/h2&gt;

&lt;p&gt;This is the problem nobody talks about, and it's the one that will cost you the most debugging time.&lt;/p&gt;

&lt;p&gt;Claude Code compresses your conversation when the context window fills up. Older messages get summarized. In theory, this is efficient. What that means in practice: the deployment rules you loaded at session start can silently vanish mid-session. The behavioral constraints? Gone. The project state? Compressed into a summary that may or may not preserve what matters.&lt;/p&gt;

&lt;p&gt;My server detects this. When Claude calls the session loader a second time in the same session, the server includes a recovery hint: the most recently active project. Claude reloads the relevant context surgically. Not the full knowledge base. Just what the current task needs.&lt;/p&gt;

&lt;p&gt;Before this existed, long sessions would silently lose their constraints around the two-hour mark. I wouldn't notice until Claude deployed a metadata package to the wrong org because the deploy rules from session start had been compressed away. The failures were quiet. That's what made them expensive.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Persistent Memory Changes for Claude Code Workflows
&lt;/h2&gt;

&lt;p&gt;Before the memory server, every session started with ten minutes of setup. Reading project files, re-establishing context, reminding Claude which org belongs to which project. Creative time wasted on logistics.&lt;/p&gt;

&lt;p&gt;After: one call. Rules load, project context loads, recent activity loads. I start working immediately.&lt;/p&gt;

&lt;p&gt;But the real win is compounding. Every session generates learnings. Deployment patterns that worked. API gotchas that burned an hour. Platform quirks that only surface in production. Those learnings get indexed automatically. The next session starts with that knowledge already searchable. The session after that starts with even more.&lt;/p&gt;

&lt;p&gt;I co-founded &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt;, a Salesforce consulting partner in Manila. The memory server runs alongside that work as a personal R&amp;amp;D system. It doesn't touch client data. What it does is compound operational knowledge across projects and platforms, so Claude rarely encounters a problem it hasn't seen a version of before.&lt;/p&gt;

&lt;p&gt;Mistakes get encoded so they don't repeat. The memory server doesn't make Claude smarter. It makes Claude less likely to be stupid in the same way twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Build Differently
&lt;/h2&gt;

&lt;p&gt;I over-indexed on features and under-indexed on condensation. The memory server had rich search, tiered loading, and compliance checking before it had a condenser. That meant every search returned massive payloads that burned through the context window. If I were starting over, condensation would be the first thing I built, not the fourth.&lt;/p&gt;

&lt;p&gt;I'd also start with a smaller embedding model. My instinct was to use the most capable sentence transformer I could find. The difference in search quality between models was marginal. The difference in startup time and memory footprint was not. A lighter model that loads in seconds would have saved weeks of debugging cold-start problems on a machine that was already running six other services.&lt;/p&gt;

&lt;p&gt;And I'd design for compression survival from day one, not bolt it on after losing context mid-session three times. That pattern is now the part of the system I trust most, but it didn't need to take three incidents to build.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;What is a Claude Code MCP server?&lt;/p&gt;

&lt;p&gt;An MCP (Model Context Protocol) server is a custom backend that gives Claude Code capabilities it doesn't have out of the box. You run the server, Claude connects to it, and your server exposes tools that Claude can call during a session. A memory-focused MCP server specifically solves the problem of Claude forgetting everything between sessions by providing persistent, searchable knowledge storage.&lt;/p&gt;

&lt;p&gt;Does Claude Code remember between sessions?&lt;/p&gt;

&lt;p&gt;Not by default. Claude Code starts every session fresh. CLAUDE.md files provide some static context, but they don't scale past a single project. A custom MCP server with a vector database and session loading gives Claude persistent memory across sessions, so it knows your project rules, past learnings, and recent activity without you re-explaining every time.&lt;/p&gt;

&lt;p&gt;What is context compression in Claude Code?&lt;/p&gt;

&lt;p&gt;When your conversation with Claude Code fills the context window, older messages get summarized to make room. This is called context compression. The problem is that rules, constraints, and project state loaded at session start can silently disappear during compression. Without a recovery mechanism, Claude forgets its guardrails mid-session.&lt;/p&gt;

&lt;p&gt;How do I add persistent memory to Claude Code?&lt;/p&gt;

&lt;p&gt;Build an MCP server that indexes your knowledge into a vector database, then expose search and retrieval as MCP tools. Claude calls these tools at session start to load context. Add a condensation layer so large results don't overflow the context window. The key insight is that memory alone isn't enough. You need condensation, tiered loading, and compression recovery to make it work at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;I'm working on a lightweight version of this memory server. Stripped to the core: vector search, session loading, and basic condensation. Enough to give Claude Code persistent memory without the full production infrastructure. Follow &lt;a href="https://github.com/tomtokitajr" rel="noopener noreferrer"&gt;my GitHub&lt;/a&gt; for updates.&lt;/p&gt;

&lt;p&gt;If you want something you can use today, I open-sourced the pre-action gate pattern. Mechanical enforcement that blocks your AI agent from executing before checking the rules. Zero dependencies. Works with Claude and Gemini.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/tomtokitajr/ai-agent-gates" rel="noopener noreferrer"&gt;github.com/tomtokitajr/ai-agent-gates&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tom Tokita is co-founder of Aether Global Technology and builds AI operations systems in Manila. He writes about what works in production.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Most AI Tools Are Just LLM Wrappers. Here's What Actually Matters.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Tue, 19 May 2026 00:36:13 +0000</pubDate>
      <link>https://dev.to/tomtokita/most-ai-tools-are-just-llm-wrappers-heres-what-actually-matters-10mg</link>
      <guid>https://dev.to/tomtokita/most-ai-tools-are-just-llm-wrappers-heres-what-actually-matters-10mg</guid>
      <description>&lt;p&gt;&lt;strong&gt;In 2025, AI wrapper startups raised over $10 billion.&lt;/strong&gt; The product? Take an LLM API. Add a text box. Maybe some prompt templates. Charge $30/month. Call it "AI-powered."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not mad at the hustle.&lt;/strong&gt; But if your entire product disappears the moment ChatGPT adds your feature for free, you don't have a product. You have a timing play.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wrapper Test
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One question tells you everything:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Can you replicate the output by pasting the same input into ChatGPT or Claude?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If yes:&lt;/strong&gt; it's a wrapper. You're paying for UI and convenience, not intelligence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If no:&lt;/strong&gt; because it's pulling from multiple data sources, applying domain logic, or integrating with real systems, it might be something real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most fail the test.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Thin vs. Thick
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Not all wrappers are equal.&lt;/strong&gt; The market is splitting fast:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Thin Wrapper&lt;/th&gt;
&lt;th&gt;Thick Wrapper&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it does&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;UI + API call + system prompt&lt;/td&gt;
&lt;td&gt;Real integrations, domain logic, data pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Defensibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None. One platform update kills it&lt;/td&gt;
&lt;td&gt;High. Value is in the connectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"AI email writer" (GPT call with a system prompt)&lt;/td&gt;
&lt;td&gt;Cursor (reads your codebase, understands project context)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Survival odds&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Decent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The graveyard of 2025–2026&lt;/strong&gt; is littered with thin wrappers that a platform update made irrelevant overnight.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Strip away the wrapper.&lt;/strong&gt; Where does the real value live?&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Connectors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The ability to talk to real systems:&lt;/strong&gt; Salesforce, Jira, databases, email, file storage, APIs. This is where 80% of the actual work lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting an AI to generate text is trivial.&lt;/strong&gt; Getting it to read your CRM records, cross-reference tickets, update a database, and notify Slack. That's integration work. That's hard. That's valuable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most wrappers don't touch this.&lt;/strong&gt; They live in the text-in, text-out world.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Captured Domain Expertise
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;An AI that's been learning your industry's quirks for months&lt;/strong&gt; is worth more than a fresh GPT-5 instance with a clever prompt.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Fresh AI + Great Prompt&lt;/th&gt;
&lt;th&gt;AI + 6 Months of Learnings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Platform quirks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Discovers them painfully&lt;/td&gt;
&lt;td&gt;Already knows them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Common mistakes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes them all&lt;/td&gt;
&lt;td&gt;Has guardrails for each&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Your terminology&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Constant correction needed&lt;/td&gt;
&lt;td&gt;Uses it naturally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Edge cases&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Surprised every time&lt;/td&gt;
&lt;td&gt;Documented patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The knowledge compounds.&lt;/strong&gt; Every session, every bug fix, every "oh, that's how this actually works" gets captured and fed back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No wrapper captures this.&lt;/strong&gt; They start fresh every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Methodology
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;How you approach problems with AI&lt;/strong&gt; matters more than which model you use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The wrapper approach:&lt;/strong&gt; open tool → type request → get output → hope it's right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The practitioner approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Small test:&lt;/strong&gt; constrained input, see what happens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate:&lt;/strong&gt; what worked? What broke?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture:&lt;/strong&gt; document the learning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjust:&lt;/strong&gt; update the approach&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repeat&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The tool is 10%. The methodology is 90%.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The "Just Build It" Case
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Here's the uncomfortable truth.&lt;/strong&gt; Building your own system (even ugly, even scrappy) gives you something no wrapper provides: &lt;strong&gt;understanding.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You know why it works.&lt;/strong&gt; Why it breaks. How to fix it. When the model changes (and it will), you swap the engine. The connectors, the learnings, the guardrails. Those persist. They're yours.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost at scale:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Wrapper Stack&lt;/th&gt;
&lt;th&gt;Custom (Direct API)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Month 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$150/seat, fast setup&lt;/td&gt;
&lt;td&gt;$500 dev time, slower start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Month 6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$150/seat, same capabilities&lt;/td&gt;
&lt;td&gt;$50/month API, growing capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Year 1 (5 seats)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$9,000&lt;/td&gt;
&lt;td&gt;~$3,100 + compound knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Custom costs less AND gets smarter.&lt;/strong&gt; The wrapper costs the same and stays the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Philippines advantage:&lt;/strong&gt; smaller teams with direct API access can outperform larger orgs paying for wrapper stacks. When you can't afford $150/seat for 6 different AI tools, you build one system that does what you need. That constraint produces better architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Wrappers DO Make Sense
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fair is fair:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speed to market:&lt;/strong&gt; need something running tomorrow without engineering capacity? Wrapper gets you there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thick wrappers with real integrations:&lt;/strong&gt; Cursor, Harvey, Perplexity add genuine value beyond the API call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exploration phase:&lt;/strong&gt; trying 5 wrappers to understand the capability space before building your own is smart R&amp;amp;D.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key question:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are you buying a tool or renting a feature?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;If the value prop is "we make it easy to talk to an LLM,"&lt;/strong&gt; that feature is getting commoditized in real time. Every model provider is making their native interface better, faster, cheaper.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Build Instead
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ready to go beyond wrappers?&lt;/strong&gt; Start here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Map your connectors.&lt;/strong&gt; What systems does your AI need to talk to? Build those integrations first. Hardest part. Most valuable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Capture everything.&lt;/strong&gt; Every platform quirk. Every failed approach. Every successful pattern. Your AI should learn from your organization's experience, not start fresh every session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Own your methodology.&lt;/strong&gt; Document how you approach problems with AI. Small tests → captured learnings → iteration. More valuable than any tool you can buy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Accept ugly.&lt;/strong&gt; The most effective AI systems I've built are not pretty. Config files, markdown documents, scripts. They look like plumbing. They work like machines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The moat isn't the model.&lt;/strong&gt; It never was.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's the connectors&lt;/strong&gt; that talk to your stack. The domain expertise captured over months. The methodology that turns every failure into a lesson.&lt;/p&gt;

&lt;p&gt;None of that lives in a wrapper.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tom Tokita. I run &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt; out of Manila. We build production AI and Salesforce systems for enterprises that need real integrations, not another wrapper. &lt;a href="https://aether-global.com/contact" rel="noopener noreferrer"&gt;Let's talk.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Read next: &lt;a href="https://dev.to/blog/context-engineering-vs-prompt-engineering"&gt;Context Engineering: Why Your AI Strategy Needs Infrastructure, Not Better Prompts&lt;/a&gt; · &lt;a href="https://dev.to/blog/autonomous-ai-agents-production-cost"&gt;Autonomous AI Agents Look Great in Demos. Here's What They Cost in Production.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>webdev</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Truth About Agent Swarming: What the Gurus Won't Tell You About Cost, Failure, and Security</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Sat, 16 May 2026 11:15:26 +0000</pubDate>
      <link>https://dev.to/tomtokita/the-truth-about-agent-swarming-what-the-gurus-wont-tell-you-about-cost-failure-and-security-1775</link>
      <guid>https://dev.to/tomtokita/the-truth-about-agent-swarming-what-the-gurus-wont-tell-you-about-cost-failure-and-security-1775</guid>
      <description>&lt;p&gt;Everyone's building "AI agent teams" right now. Five agents, ten agents, a whole swarm collaborating on complex tasks. At least that's what the YouTube thumbnails promise. The reality? Most of these systems are burning money, leaking data, and failing in ways their builders don't even notice until the invoice arrives.&lt;/p&gt;

&lt;p&gt;I built a multi-agent system. It runs in production, daily. So I'm not here to tell you agent swarming doesn't work. I'm here to tell you that most of the advice circulating about it is dangerously incomplete.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Swarm Hype Cycle Is in Full Swing
&lt;/h2&gt;

&lt;p&gt;Open Twitter or YouTube right now and you'll find a hundred tutorials showing you how to spin up a multi-agent team in under 20 minutes. CrewAI, AutoGen, LangGraph. The frameworks keep multiplying. The demos look incredible: agents researching, agents writing, agents reviewing each other's work, all orchestrated into a beautiful pipeline.&lt;/p&gt;

&lt;p&gt;Here's what the demos don't show: what happens when you run that pipeline 500 times. Or 5,000 times. Or when one agent hallucinates and the next agent treats that hallucination as fact and passes it downstream to a third agent that takes action on it.&lt;/p&gt;

&lt;p&gt;The guru content follows a pattern: show the setup, show one successful run, skip the failure modes, skip the bill, skip the security implications. It's like showing someone how to start a restaurant by filming one perfect dinner service and cutting before the health inspector shows up.&lt;/p&gt;

&lt;p&gt;The latest version of this is "I built an entire company in 30 minutes with AI agents." Someone spins up a framework like &lt;a href="https://github.com/nicepkg/paperclip" rel="noopener noreferrer"&gt;Paperclip&lt;/a&gt; (which, to be fair, has genuinely solid engineering underneath it: heartbeat scheduling, budget caps, task queues, audit trails), and the content that follows makes it sound like you can replace an entire org overnight. The tool isn't the problem. The tool is fine. The problem is the interpretation layer: gurus filming the setup, skipping the part where 48 pre-configured agents wake up every 4 hours on a frontier model and nobody mentions what that costs at the end of the month. Or what happens when agent #23 gets a poisoned input and the other 47 trust its output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Multi-Agent AI Fails in Production
&lt;/h2&gt;

&lt;p&gt;The coordination problem is real and it scales badly. &lt;a href="https://galileo.ai/blog/why-multi-agent-systems-fail" rel="noopener noreferrer"&gt;Galileo's research on multi-agent reliability&lt;/a&gt; found that adding agents multiplies failure points exponentially. Four agents create six potential failure points, not four. Ten agents create 45. Every agent-to-agent handoff is a place where context gets lost, instructions get misinterpreted, or outputs get corrupted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.cio.com/article/4143420/true-multi-agent-collaboration-doesnt-work.html" rel="noopener noreferrer"&gt;CIO reported in March 2026&lt;/a&gt; that true multi-agent collaboration remains largely aspirational. Their testing showed single agents hitting 100% success rates on isolated tasks, while hierarchical multi-agent structures failed 64% of the time and self-organized swarms failed 68%. That's not a rounding error. That's a fundamental coordination tax.&lt;/p&gt;

&lt;p&gt;The failure modes I've seen firsthand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No purpose definition.&lt;/strong&gt; Agents exist because someone saw a cool demo, not because the task requires decomposition. A single well-prompted agent with good tools will outperform a badly orchestrated team of five every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No role boundaries.&lt;/strong&gt; Two agents stepping on each other's work, or worse, one agent undoing what another just did. Without strict scoping, you get agents arguing in loops, burning tokens while producing nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cascade failures.&lt;/strong&gt; Agent A hallucinates a "fact." Agent B cites it. Agent C acts on it. By the time a human reviews the output, three layers of confident-sounding nonsense have compounded. &lt;a href="https://galileo.ai/blog/why-multi-agent-systems-fail" rel="noopener noreferrer"&gt;Galileo calls this "propagation of inaccuracies"&lt;/a&gt; and it's the single biggest reliability risk in multi-agent systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Pattern&lt;/th&gt;
&lt;th&gt;What Happens&lt;/th&gt;
&lt;th&gt;How It Scales&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No purpose definition&lt;/td&gt;
&lt;td&gt;Agents do work a single agent could handle&lt;/td&gt;
&lt;td&gt;Cost multiplies, quality stays flat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No role boundaries&lt;/td&gt;
&lt;td&gt;Agents duplicate or undo each other's work&lt;/td&gt;
&lt;td&gt;Token burn scales quadratically with agent count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cascade hallucination&lt;/td&gt;
&lt;td&gt;Bad output propagates through the chain&lt;/td&gt;
&lt;td&gt;Compounds per hop. 3 agents = 3 layers of compounded error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window overflow&lt;/td&gt;
&lt;td&gt;Shared context exceeds model limits, agents lose thread&lt;/td&gt;
&lt;td&gt;Every agent's output inflates the shared context for every other agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orchestrator bottleneck&lt;/td&gt;
&lt;td&gt;Single coordinator becomes the weakest link&lt;/td&gt;
&lt;td&gt;Orchestrator complexity grows O(n²) with agent count&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The API Bill Nobody Shows You
&lt;/h2&gt;

&lt;p&gt;Every agent in your swarm is an API call. More accurately, every agent is &lt;em&gt;multiple&lt;/em&gt; API calls: the initial prompt, the tool calls, the retries, the context-sharing between agents. A five-agent team running on a frontier model isn't 5x the cost of one agent. It's often 10-15x once you factor in coordination overhead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.getmonetizely.com/articles/the-complete-guide-to-agent-swarm-pricing-models-how-should-you-price-collective-ai-intelligence" rel="noopener noreferrer"&gt;Stanford's AI Index Report, cited by Monetizely&lt;/a&gt;, found that coordination overhead alone accounts for 15-25% of total operational costs in mature multi-agent systems. That's before you count the actual task execution.&lt;/p&gt;

&lt;p&gt;Here's how the math works in practice. Say you're running a research-and-write pipeline with five agents (researcher, analyst, writer, editor, fact-checker). Each agent averages 3,000 input tokens and 1,500 output tokens per task. On a frontier model, that's roughly $0.04 per agent per task &lt;em&gt;(pricing as of March 2026; check your provider's current rates)&lt;/em&gt;. Five agents: $0.20 per task. Sounds cheap, right?&lt;/p&gt;

&lt;p&gt;Now add retries (agent disagrees with another agent's output, re-runs). Add context sharing (every agent needs to see what the others produced, and input tokens multiply). Add the orchestrator's overhead. Add recursive thinking where an agent calls itself to refine. In production, that $0.20 task routinely becomes $0.80-$1.50. Run it 100 times a day and you're looking at $80-$150 daily, or $2,400-$4,500 monthly. For a single pipeline.&lt;/p&gt;

&lt;p&gt;The gurus never show you the billing dashboard. I've seen my own costs spike 4x in a single day when an agent hit a retry loop that the orchestrator didn't catch. That's the kind of lesson you only learn in production, not in a 20-minute tutorial. I wrote more about &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;what autonomous agents actually cost in production&lt;/a&gt;, the single-agent version of this problem, which multi-agent compounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Security Problem Nobody's Talking About
&lt;/h2&gt;

&lt;p&gt;This is the part that genuinely concerns me. People are downloading MCP servers from GitHub, connecting premade agent builders, and giving their swarm access to production databases, file systems, and APIs, without auditing a single line of the code routing their data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.covertswarm.com/post/multi-agent-ai-security-risks" rel="noopener noreferrer"&gt;CovertSwarm's January 2026 analysis&lt;/a&gt; exposed how agent-to-agent communication can be exploited through prompt injection, where one compromised agent manipulates another agent's behavior through crafted outputs. In a multi-agent system, a single compromised node can cascade manipulation across the entire swarm.&lt;/p&gt;

&lt;p&gt;The security gaps I see repeated constantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No credential scoping.&lt;/strong&gt; Every agent gets the same API keys with the same permissions. Your research agent has write access to your production database. Your summarizer can send emails. Why?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No output boundaries.&lt;/strong&gt; Agent outputs aren't sanitized before being passed to the next agent. That's how prompt injection propagates. A malicious input in a research result becomes an instruction to the next agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unaudited external tools.&lt;/strong&gt; That MCP server you downloaded because it had 200 GitHub stars? Did you read its source? Do you know where it sends your data? Most people don't. &lt;a href="https://tokita.online/llm-wrappers-what-actually-matters/" rel="noopener noreferrer"&gt;Most AI tools are just wrappers&lt;/a&gt; with varying levels of transparency about what happens between your input and the LLM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No audit trail.&lt;/strong&gt; When something goes wrong in a five-agent pipeline, can you reconstruct what each agent saw, decided, and produced? Most frameworks don't log at that granularity by default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Actually Works (From Someone Who Built One)
&lt;/h2&gt;

&lt;p&gt;I run a multi-agent system in production. It works. But it works because I built it with specific constraints from day one, not because I followed a framework tutorial.&lt;/p&gt;

&lt;p&gt;Here's what I've learned, without exposing the blueprint:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with a purpose.&lt;/strong&gt; Every agent in the system exists because a specific task requires it. If a single agent can do the job, a single agent does the job. The question isn't "how many agents can I add?" It's "what's the minimum number of agents that makes this task decomposition actually valuable?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run it monitored, not autonomous.&lt;/strong&gt; The fantasy is agents running completely on their own, 24/7, while you sleep. The reality is that unmonitored agents drift. They develop patterns you didn't intend. They find edge cases your orchestration doesn't handle. Monitor heavily, especially early on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set an end date.&lt;/strong&gt; Bounded execution, not open-ended. An agent swarm should complete its task and stop. "Run this analysis, produce this output, terminate." Not "keep running until I tell you to stop." Open-ended swarms are where costs and drift compound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope each agent's permissions.&lt;/strong&gt; Every agent gets exactly the access it needs and nothing more. Read-only where possible. No shared credentials. If an agent needs to write to a database, that's a deliberate architectural decision with boundaries, not a default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit every external tool before connecting.&lt;/strong&gt; Every MCP server, every API integration, every external data source. Read the code, understand the data flow, verify the trust boundaries. If you can't audit it, don't connect it.&lt;/p&gt;

&lt;p&gt;The pattern underneath all of this: multi-agent systems work when they're purpose-built by someone who understands every component. They fail when they're assembled from YouTube tutorials by people who are optimizing for "cool demo" instead of "reliable production system."&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;



&lt;p&gt;Are multi-agent AI systems worth building?&lt;span&gt;+&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Yes, if the task genuinely requires decomposition across specialized roles. Research pipelines, complex analysis workflows, and multi-step processes with distinct skill requirements are legitimate use cases. The problem isn't multi-agent as a concept. It's multi-agent as a default approach when a single well-tooled agent would do the job better, cheaper, and more reliably.&lt;/p&gt;



&lt;p&gt;How much does it cost to run a multi-agent AI system?&lt;span&gt;+&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;It depends on the model, agent count, and task complexity, but multi-agent costs are multiplicative, not additive. A five-agent pipeline on a frontier model can cost 10-15x what a single agent costs per task once you factor in context sharing, retries, and coordination overhead. &lt;a href="https://www.getmonetizely.com/articles/the-complete-guide-to-agent-swarm-pricing-models-how-should-you-price-collective-ai-intelligence" rel="noopener noreferrer"&gt;Stanford's AI Index Report via Monetizely estimates&lt;/a&gt; coordination overhead alone accounts for 15-25% of operational costs. Budget for at least 3-5x your single-agent baseline when planning multi-agent deployments.&lt;/p&gt;



&lt;p&gt;What are the biggest security risks with AI agent swarms?&lt;span&gt;+&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The top risks are unscoped credentials (every agent gets full access instead of minimum required), unaudited external tools (MCP servers and API integrations you didn't read the source for), and agent-to-agent prompt injection (where a compromised agent manipulates others through crafted outputs). &lt;a href="https://www.covertswarm.com/post/multi-agent-ai-security-risks" rel="noopener noreferrer"&gt;CovertSwarm documented&lt;/a&gt; how inter-agent trust can be exploited in January 2026.&lt;/p&gt;



&lt;p&gt;Should I use CrewAI, AutoGen, or LangGraph for multi-agent AI?&lt;span&gt;+&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;The framework matters less than the architecture decisions you make within it. All three can produce working multi-agent systems, and all three can produce expensive failures. The questions that actually matter: Do you have a clear purpose for each agent? Are permissions scoped per agent? Do you have monitoring and cost controls? Can you audit every external integration? If you can't answer yes to all four, the framework choice is irrelevant. You'll fail regardless of which one you pick.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Agent swarms aren't bad. Unexamined swarms are. The technology works. I use it daily. But it works because every agent has a purpose, every permission is scoped, every external tool is audited, and the whole system runs monitored with bounded execution.&lt;/p&gt;

&lt;p&gt;The gap in the current conversation isn't technical capability. It's operational maturity. The frameworks are getting better. The models are getting cheaper. But the advice circulating ("just add more agents") is setting people up to build expensive, insecure systems they don't understand.&lt;/p&gt;

&lt;p&gt;Build with purpose. Monitor heavily. Kill when done.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tom Tokita is the President of &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, a Salesforce consulting firm in Manila. He built a personal AI operations system as his daily driver. Not planned. Engineered out of necessity. He writes about what works, what breaks, and what the industry keeps getting wrong.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>security</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Someone Called My AI System a Tool. Then They Showed Me Theirs.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Sat, 09 May 2026 16:08:07 +0000</pubDate>
      <link>https://dev.to/tomtokita/someone-called-my-ai-system-a-tool-then-they-showed-me-theirs-4954</link>
      <guid>https://dev.to/tomtokita/someone-called-my-ai-system-a-tool-then-they-showed-me-theirs-4954</guid>
      <description>&lt;p&gt;Someone at a conference asked me what I'd been building. I described a system I use daily. Over 200 sessions of accumulated learnings. 45 mechanical hooks that fire before and after every action. Anti-fabrication gates that block the AI from stating anything it hasn't verified. Memory that survives context compression. Deploy protections that physically prevent wrong-target pushes. A behavioral identity that gets re-injected every message so the system doesn't drift into generic assistant mode.&lt;/p&gt;

&lt;p&gt;He nodded and said, "Oh, so you built a tool."&lt;/p&gt;

&lt;p&gt;Then he described his. "I built something similar," he said. An agent framework. A React dashboard. A task board. Some cron jobs. A dozen agents with names. A job worker that shells out to the agent CLI and captures stdout. He showed me the architecture diagram. Three boxes connected by arrows.&lt;/p&gt;

&lt;p&gt;I asked about guardrails. "What do you mean?" I asked what happens when an agent hallucinates a data point and the next agent downstream treats it as fact. He said that hasn't happened yet. I asked about credential scoping. Every agent had the same API keys with the same permissions. I asked what happens when context compresses mid-task. He didn't know what context compression was.&lt;/p&gt;

&lt;p&gt;We were not building the same thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Assembly Pattern
&lt;/h2&gt;

&lt;p&gt;This pattern is everywhere right now. Pull an open-source agent framework. Fork a React cockpit from GitHub. Wire them together with a thin HTTP layer. Add some agent definitions with fun names. Ship a demo. Call it "AI infrastructure."&lt;/p&gt;

&lt;p&gt;It works in the demo. It works for the screenshot. It even works the first five times you run it.&lt;/p&gt;

&lt;p&gt;It stops working when an agent fabricates a statistic and your client reads it. When a retry loop burns $400 in API calls overnight because nothing capped the spend. When an agent with write access to your production database decides to "clean up" records it hallucinated as duplicates.&lt;/p&gt;

&lt;p&gt;The assembly is the easy part. The demo is the easy part. What comes after the demo is where the actual engineering lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Missing From Every Patchwork Build I've Reviewed
&lt;/h2&gt;

&lt;p&gt;I've audited three of these setups in the past year. Internal team builds, partner builds, open-source-assembled stacks. The gaps are identical every time.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What Production Requires&lt;/th&gt;
&lt;th&gt;What the Patchwork Has&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-action gates (mechanical blocks before execution)&lt;/td&gt;
&lt;td&gt;Nothing. Agent output accepted as final answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anti-fabrication (every claim must trace to a source)&lt;/td&gt;
&lt;td&gt;Nothing. Whatever the LLM says is treated as fact&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anti-drift detection (behavioral correction over long sessions)&lt;/td&gt;
&lt;td&gt;Nothing. Agents drift silently&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent memory with session recovery&lt;/td&gt;
&lt;td&gt;Stateless. Fresh context every run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Captured learnings (compound knowledge over time)&lt;/td&gt;
&lt;td&gt;Nothing. Same mistakes are repeatable indefinitely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Credential scoping per agent&lt;/td&gt;
&lt;td&gt;Shared keys, full permissions, no boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human checkpoints on multi-step tasks&lt;/td&gt;
&lt;td&gt;Fully autonomous, no review loop&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The common response: "We'll add that later." In my experience, later means after the first production incident. And the first production incident in an unharnessed AI system is rarely small.&lt;/p&gt;

&lt;h2&gt;
  
  
  Assembly Is Not Engineering
&lt;/h2&gt;

&lt;p&gt;I want to be clear. I'm not against using open-source. I use open-source tools constantly. MIT-licensed projects power parts of my own stack. Pulling from the community is smart and efficient.&lt;/p&gt;

&lt;p&gt;But there's a gap between assembling components and engineering a system. Assembly is connecting boxes. Engineering is understanding what happens at every connection point when things go wrong. What happens when the model hallucinates at step 3 of a 7-step pipeline? What happens when context compresses and the agent forgets the rules you set 40 messages ago? What happens when an agent gets a poisoned input from an unaudited MCP server?&lt;/p&gt;

&lt;p&gt;If you can't answer those questions, you haven't built infrastructure. You've built a demo with a longer runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  "I'll Just Have My AI Build It"
&lt;/h2&gt;

&lt;p&gt;This is the part that genuinely worries me.&lt;/p&gt;

&lt;p&gt;The assembly pattern is accelerating because people are using AI to do the assembling. "I'll just have Claude/GPT scaffold my agent system." The AI reads some docs, maybe runs a web search, ingests a few blog posts about agent frameworks, and produces something that looks like architecture. Clean folder structure. Reasonable-sounding agent definitions. Maybe even a README with a diagram.&lt;/p&gt;

&lt;p&gt;But it's architecture by hallucination. The AI doesn't know what breaks in production because it's never been in production. It doesn't know that context compression silently erases behavioral rules at message 180. It doesn't know that an unscoped MCP server will happily route your client data through an endpoint you never audited. It doesn't know that "just add a retry" turns a $0.20 task into a $40 task when the retry loop has no ceiling.&lt;/p&gt;

&lt;p&gt;What you get is a system that looks engineered but isn't. It passes the screenshot test. It passes the "show the team" test. It fails the Tuesday afternoon test, when something unexpected happens and there's no gate to catch it, no captured learning to reference, no incident history to draw from.&lt;/p&gt;

&lt;p&gt;AI is intelligent. It can write code, generate configurations, and produce plausible architectures. What it cannot do is architect from pain it hasn't experienced. Every rule in a real harness exists because something specific went wrong. The AI building your system hasn't had things go wrong yet. It's working from blog posts and documentation, not from the 11 PM deploy that almost went to the wrong org.&lt;/p&gt;

&lt;p&gt;The irony is thick. An unharnessed AI building the infrastructure that's supposed to harness AI. The output will be confident, well-structured, and missing every lesson that only production teaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Infrastructure" Actually Means
&lt;/h2&gt;

&lt;p&gt;The system I described at that conference didn't start as infrastructure. It started as a mess. A rules file that grew from 5 entries to 27 because the AI kept finding new ways to surprise me. A hook I wrote at 11 PM because the system nearly pushed metadata to the wrong environment. A memory protocol I built because the AI forgot everything after context compression and started making the same mistakes I'd fixed three hours earlier.&lt;/p&gt;

&lt;p&gt;Every rule in the harness traces to a specific failure. That's not architecture by design. It's architecture by incident. But it compounds. 200+ sessions of captured learnings means the system knows things a fresh agent never will. Platform quirks, client-specific constraints, failure patterns that repeat across projects. None of that lives in an agent framework you pulled from GitHub last Tuesday.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;I wrote about this convergence pattern recently&lt;/a&gt;. Multiple teams, from OpenAI to Martin Fowler's group to a solo practitioner in Manila, arrived at the same conclusion independently: the harness is the product, not the model. A disciplined harness on a weaker model beats an unconstrained stronger model every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Question
&lt;/h2&gt;

&lt;p&gt;Next time someone shows you their "AI infrastructure," ask them three questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What happens when an agent fabricates a data point? Is there a mechanical gate, or do you just hope it doesn't?&lt;/li&gt;
&lt;li&gt;What happens after context compression? Does the system recover its behavioral rules, or does it revert to a generic assistant?&lt;/li&gt;
&lt;li&gt;Can you trace every rule in your system to a specific incident that forced you to add it?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answers are "hasn't happened yet," "what's context compression," and a blank stare, you're looking at a patchwork. Not infrastructure.&lt;/p&gt;

&lt;p&gt;And that's fine. Everyone starts with a patchwork. I did. The question is whether you know the difference.&lt;/p&gt;

&lt;p&gt;If you want to start building the real thing, I wrote a &lt;a href="https://tokita.online/ai-agent-pre-action-gate-tutorial/" rel="noopener noreferrer"&gt;hands-on tutorial with three production-tested gates and starter code&lt;/a&gt;. The gates are also packaged as a &lt;a href="https://github.com/tomtokitajr/ai-agent-gates" rel="noopener noreferrer"&gt;ready-to-clone repo on GitHub&lt;/a&gt;. Zero dependencies, works with any LLM provider.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tom Tokita. I run &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt; out of Manila. I've been building and operating a production AI system daily for over 200 sessions. I write about what works, what breaks, and the gap between demos and production. &lt;a href="https://tokita.online" rel="noopener noreferrer"&gt;More on tokita.online.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>security</category>
    </item>
    <item>
      <title>Context Engineering: Why Your AI Strategy Needs Infrastructure, Not Better Prompts</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Sat, 09 May 2026 13:07:46 +0000</pubDate>
      <link>https://dev.to/tomtokita/context-engineering-why-your-ai-strategy-needs-infrastructure-not-better-prompts-378j</link>
      <guid>https://dev.to/tomtokita/context-engineering-why-your-ai-strategy-needs-infrastructure-not-better-prompts-378j</guid>
      <description>&lt;p&gt;&lt;strong&gt;Five minutes on LinkedIn&lt;/strong&gt; and you'll find it. Someone sharing "the one prompt that changed everything." A magic system prompt. A secret ChatGPT trick. A "10x framework."&lt;/p&gt;

&lt;p&gt;I've built production AI systems across enterprise consulting, content automation, and internal operations. The prompt is maybe 5% of why any of it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The other 95%?&lt;/strong&gt; Infrastructure. Memory. Enforcement. Captured learnings. That's context engineering, and it's the skill that actually matters in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompt Engineering Has a Ceiling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering isn't useless.&lt;/strong&gt; It's just the starting line. Here's what the prompt gurus conveniently leave out:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What They Show&lt;/th&gt;
&lt;th&gt;What Actually Happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fresh conversation, perfect prompt&lt;/td&gt;
&lt;td&gt;Message 200. Context window full, business rules forgotten&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-shot demo, curated input&lt;/td&gt;
&lt;td&gt;Production workflow hitting edge cases the prompt never anticipated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Just tell the AI to be careful"&lt;/td&gt;
&lt;td&gt;AI ignoring that instruction 3 hours into a session&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Prompts are stateless.&lt;/strong&gt; Every conversation starts from zero. Your AI doesn't remember what worked yesterday or what broke last week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's not a prompt problem.&lt;/strong&gt; That's an infrastructure problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Context Engineering?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt; designing systems that deliver the right information to an AI at the right time, maintain behavioral consistency, and improve through captured experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's not a prompt template.&lt;/strong&gt; It's architecture.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Prompt engineering&lt;/strong&gt; = giving a new hire a great job description.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context engineering&lt;/strong&gt; = giving them the job description, an onboarding manual, institutional knowledge, and a manager who catches mistakes before they ship.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which one performs better on day 30?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Layers
&lt;/h2&gt;

&lt;p&gt;Every production AI system I've built operates on three layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: What the AI Knows Right Now
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The active context:&lt;/strong&gt; current conversation, task at hand, files being worked on. Most people stop here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: What It Can Retrieve When Needed
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The retrieval layer:&lt;/strong&gt; persistent memory, documented learnings, platform-specific knowledge the AI pulls in when relevant. The AI needs to know &lt;em&gt;where to look&lt;/em&gt;, not memorize everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: What It's Mechanically Prevented From Doing Wrong
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The enforcement layer:&lt;/strong&gt; automated checks that fire before or after AI actions. Not guidelines. Not suggestions. &lt;strong&gt;Mechanical gates.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The gap:&lt;/strong&gt; most AI implementations have Layer 1. Some have Layer 2. Almost nobody has Layer 3.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory: Teaching AI to Remember
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The biggest lie in AI tooling&lt;/strong&gt; is that conversation history equals memory. It doesn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conversation history is a rolling buffer&lt;/strong&gt; that gets compressed, truncated, or dropped. Your AI doesn't "remember." It reads what's still in the window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production memory looks different:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Persistent state files:&lt;/strong&gt; structured notes the AI reads at session start. Project status, decisions made, open items. Intentional, curated memory, not chat history.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Session recovery:&lt;/strong&gt; what happens after context compression or a new session? If the answer is "start over," you're re-teaching the AI every time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Platform learnings:&lt;/strong&gt; captured knowledge about specific tools and platforms. Every quirk, every gotcha, every workaround. An AI that's absorbed 100+ sessions of this doesn't make rookie mistakes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The compound effect:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;What the AI Knows&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Day 1&lt;/td&gt;
&lt;td&gt;The prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Week 2&lt;/td&gt;
&lt;td&gt;Prompt + 10 captured learnings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Month 3&lt;/td&gt;
&lt;td&gt;Prompt + 60 learnings + platform quirks + failure patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Month 6&lt;/td&gt;
&lt;td&gt;Knows your business better than most new hires&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;That's the moat.&lt;/strong&gt; No prompt template replicates six months of captured institutional knowledge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enforcement: Mechanical Gates, Not Vibes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;"Be careful" is not a guardrail.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Writing "always verify before acting" in a system prompt&lt;/strong&gt; is a suggestion. The AI follows it when convenient, ignores it when confidence is high. I've watched it happen dozens of times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production enforcement is mechanical:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pre-action gates:&lt;/strong&gt; automated checks that fire &lt;em&gt;before&lt;/em&gt; execution. The AI literally cannot proceed without passing. Not a prompt instruction. A system-level block.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Anti-drift detection:&lt;/strong&gt; AI behavior softens toward generic assistant mode over long sessions. Enforcement catches this and corrects it. Mechanically. Not by asking nicely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Anti-fabrication:&lt;/strong&gt; every data point traces to a named source. No source? Flagged, not presented as fact. In client work, fabricated data is career-ending.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scope control:&lt;/strong&gt; the AI does what was asked. Not "while I'm here, let me also improve this." Bug fix ≠ refactor. Enforced.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Stop thinking about what you &lt;em&gt;want&lt;/em&gt; the AI to do. Start thinking about what you need to &lt;strong&gt;prevent&lt;/strong&gt; it from doing.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Methodology: Small Tests, Captured Learnings, Iteration
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The guru approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Craft the perfect prompt&lt;/li&gt;
&lt;li&gt;Ship it&lt;/li&gt;
&lt;li&gt;Hope it works&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The practitioner approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run a small test&lt;/li&gt;
&lt;li&gt;See what breaks&lt;/li&gt;
&lt;li&gt;Capture the lesson&lt;/li&gt;
&lt;li&gt;Update the system&lt;/li&gt;
&lt;li&gt;Run again&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Boring? Yes. Effective? Absolutely.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every bug fix becomes a learning.&lt;/strong&gt; Every platform quirk gets documented. Every failure mode gets a guardrail. The system gets smarter not because the model improved, but because you designed it to learn from its own mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building from the Philippines,&lt;/strong&gt; we work with smaller teams and tighter budgets. We can't afford an AI that makes the same mistake twice. The methodology isn't a nice-to-have. It's survival.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Infrastructure Beats Inspiration
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The "magic prompt" has a half-life.&lt;/strong&gt; Models update. Context windows change. Your clever prompt breaks. You rewrite it. It breaks again. Welcome to the treadmill.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Magic Prompt&lt;/th&gt;
&lt;th&gt;Context Infrastructure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model update&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Breaks, needs rewrite&lt;/td&gt;
&lt;td&gt;Swap the engine, keep the learnings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Long session&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Degrades, drifts&lt;/td&gt;
&lt;td&gt;Mechanical gates hold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New platform&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Starts from zero&lt;/td&gt;
&lt;td&gt;Builds on captured learnings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Team scales&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Everyone writes their own prompts&lt;/td&gt;
&lt;td&gt;Everyone uses the same system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Day 200&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same as Day 1&lt;/td&gt;
&lt;td&gt;200 days of compound knowledge&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The uncomfortable truth:&lt;/strong&gt; building AI infrastructure is boring. Config files. Memory protocols. Documentation. Capture routines. Doesn't make a great LinkedIn carousel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But it's the difference&lt;/strong&gt; between an AI demo and an AI system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;You don't need to build everything at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Give your AI memory.&lt;/strong&gt; A file it reads at session start: project state, decisions, open items. Even a simple markdown file. Never start from zero.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Add one guardrail.&lt;/strong&gt; Pick your AI's most common failure mode. Build one mechanical check for it. Not a prompt instruction. A gate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Capture one learning per session.&lt;/strong&gt; What broke? What worked? What should the AI remember next time? Write it down. Feed it back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Build from there.&lt;/strong&gt; The system doesn't have to be elegant. It has to work. And improve.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering gets you started.&lt;/strong&gt; Context engineering gets you to production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The practitioners who win&lt;/strong&gt; in the next two years won't be the best prompt writers. They'll be the ones who built systems that remember, enforce, and learn.&lt;/p&gt;

&lt;p&gt;The infrastructure is boring. The results aren't.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tom Tokita. I run &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt; out of Manila. We build production AI systems and Salesforce implementations for companies that need things to actually work. Want to talk context engineering or argue about whether prompt engineering is dead? &lt;a href="https://aether-global.com/contact" rel="noopener noreferrer"&gt;Let's go.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Read next: &lt;a href="https://dev.to/blog/autonomous-ai-agents-production-cost"&gt;Autonomous AI Agents Look Great in Demos. Here's What They Cost in Production.&lt;/a&gt; · &lt;a href="https://dev.to/blog/llm-wrappers-what-actually-matters"&gt;Most AI Tools Are Just LLM Wrappers. Here's What Actually Matters.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>I Didn't Know I Was Doing Harness Engineering</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Tue, 05 May 2026 08:59:53 +0000</pubDate>
      <link>https://dev.to/tomtokita/i-didnt-know-i-was-doing-harness-engineering-5a01</link>
      <guid>https://dev.to/tomtokita/i-didnt-know-i-was-doing-harness-engineering-5a01</guid>
      <description>&lt;p&gt;In February 2026, &lt;a href="https://mitchellh.com/writing/my-ai-adoption-journey" rel="noopener noreferrer"&gt;Mitchell Hashimoto&lt;/a&gt; (co-founder of HashiCorp) described his habit of engineering permanent fixes into an AI agent's environment whenever it made a mistake. He called it "engineering the harness." Days later, &lt;a href="https://openai.com/index/harness-engineering/" rel="noopener noreferrer"&gt;OpenAI formalized the concept&lt;/a&gt; in a blog post. Around the same time, without having read either, I wrote my first enforcement hook for a production AI system. Different continent, different scale, different context. Same problem.&lt;/p&gt;

&lt;p&gt;A few weeks later, Birgitta Bockeler &lt;a href="https://martinfowler.com/articles/harness-engineering.html" rel="noopener noreferrer"&gt;formalized it on Martin Fowler's site&lt;/a&gt;. Red Hat published their version. LangChain. Salesforce. By April, the term was everywhere.&lt;/p&gt;

&lt;p&gt;I didn't discover any of this until recently. I was too busy building the thing they were naming.&lt;/p&gt;

&lt;p&gt;That's not a flex. It's something more interesting. When engineers face the same constraints (unreliable model outputs, production stakes, context that evaporates), they converge on the same solutions. Different trails, same summit. And if your messy pile of rules and scripts looks suspiciously like what OpenAI and Fowler describe, that's not coincidence. It's validation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Harness Engineering (And Why It Matters for AI Agents)
&lt;/h2&gt;

&lt;p&gt;Harness engineering is the discipline of building the constraints, gates, memory systems, and feedback loops that wrap around an AI agent to make it reliable in production. The core equation, from Martin Fowler's team: &lt;strong&gt;Agent = Model + Harness.&lt;/strong&gt; The harness is everything around the model that you actually control.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://developers.redhat.com/articles/2026/04/07/harness-engineering-structured-workflows-ai-assisted-development" rel="noopener noreferrer"&gt;Red Hat&lt;/a&gt; puts it differently. "The AI writes better code when you design the environment it works in." Their framing is about structured workflows. Templates. Impact maps. Acceptance criteria.&lt;/p&gt;

&lt;p&gt;Both are right. Neither is complete.&lt;/p&gt;

&lt;p&gt;They describe the architecture. They don't describe the pain that forces you to build it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How My Harness Grew (Without Me Realizing What It Was)
&lt;/h2&gt;

&lt;p&gt;I run a production AI system as a daily driver. Not a demo. Not a proof of concept. A system that manages infrastructure, writes code, deploys to servers, interacts with APIs, and handles real stakes across real projects. I co-founded &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt;, a Salesforce consulting partner in Manila. The system runs alongside that work.&lt;/p&gt;

&lt;p&gt;I never sat down and said "I'm going to build a harness." I just kept getting burned, and kept adding rules so I wouldn't get burned the same way twice. Looking back, every rule traces to a specific failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The anti-fabrication rules&lt;/strong&gt; exist because the AI confidently stated a method existed in a file it hadn't read. I spent 45 minutes debugging code that was never there. The fix wasn't better prompting. It was a mechanical gate: before asserting any method name or file path, the system must verify via tool. No verification, no assertion. That's a feedforward control, in Fowler's language. I just called it "stop making things up."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deploy gate&lt;/strong&gt; exists because the system nearly pushed Salesforce metadata to the wrong sandbox. 54 files, wrong org. The fix was a target allowlist per project, checked mechanically before any deploy command executes. A hard block, not a polite suggestion. (Sound familiar? &lt;a href="https://tokita.online/ai-agent-production-safety/" rel="noopener noreferrer"&gt;An AI agent deleted a production database in 9 seconds&lt;/a&gt; because nobody built one of these.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The anti-drift rules&lt;/strong&gt; exist because after multiple tool calls, the system's mental model of a file diverges from the file's actual state. It recalls values it read 20 minutes ago, not the values that exist now. The fix: re-read the source before emitting anything external-facing. Grep at write time, not recall time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The citation requirement&lt;/strong&gt; exists because the system generated a client proposal with a number it pulled from nowhere. In consulting, a wrong number in front of a client is a credibility hit you don't recover from. The rule is simple now: every data claim needs a source. No source, mark it as unverified. No exceptions.&lt;/p&gt;

&lt;p&gt;None of these came from reading a framework. They came from things going wrong on a Tuesday afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Fowler Gets Right
&lt;/h2&gt;

&lt;p&gt;The dual-control model is real. You need both feedforward controls (rules that prevent bad behavior before it happens) and feedback controls (sensors that catch it after). Relying on just one creates blind spots.&lt;/p&gt;

&lt;p&gt;My system has 40+ feedforward hooks. They fire before tool calls, checking for unauthorized domains, verifying pre-task knowledge checks happened, blocking destructive git operations, enforcing deploy targets. The same problems I wrote about in &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;what autonomous agents actually cost in production&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The feedback side is thinner. I have post-execution checks and monitoring, but the honest truth is that feedforward controls do most of the heavy lifting. Catching a bad action before it executes is cheaper than cleaning up after it runs.&lt;/p&gt;

&lt;p&gt;Fowler also nails the distinction between computational and inferential controls. My deploy gate is computational. It checks a JSON allowlist. Takes milliseconds. My anti-fabrication system is inferential. It relies on the model itself to flag uncertainty. That's slower, less reliable, and more expensive. But it catches things no deterministic check can.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Frameworks Miss
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Harnesses are incident-driven, not architecture-driven.&lt;/strong&gt; The literature treats harness engineering as a design discipline. It is, eventually. But every harness I've seen starts as a pile of duct tape applied after something broke. The elegance comes later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context survival is the real engineering problem.&lt;/strong&gt; Nobody talks about this enough. AI agents operate in conversation windows. Those windows compress. When they compress, the agent forgets rules, loses project state, and starts making the same mistakes you fixed three hours ago. My harness has a dedicated recovery protocol: when context compresses, reload memory, re-read project state, verify the date (the agent doesn't know what day it is after compression). That's not in any of the frameworks. It should be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The harness is the product, not the model.&lt;/strong&gt; When people evaluate AI systems, they compare models. Claude vs. GPT vs. Gemini. That's the wrong comparison. The model is interchangeable. I've run the same harness across model versions, and the harness determines output quality more than the model does. A disciplined harness on a weaker model beats an unconstrained stronger model every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human checkpoints aren't optional.&lt;/strong&gt; Red Hat says "human review between planning and implementation." That's correct but undersells it. In my system, any task with three or more steps requires a plan review before execution. Single-step tasks state the intended action and wait. This isn't a nice-to-have. It's the difference between an AI agent that helps and one that creates work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same Summit, Different Trails
&lt;/h2&gt;

&lt;p&gt;Here's what I find encouraging about this whole thing.&lt;/p&gt;

&lt;p&gt;My first hook was mid-February 2026. By March, I'd codified the principle "mechanical enforcement over behavioral commitment" because telling the model not to do something stopped working the moment context compressed. By April, I had 30+ hooks, a memory layer that survives compression, and a pre-task gate system that forces verification before every edit.&lt;/p&gt;

&lt;p&gt;I built all of this without reading a single blog post about harness engineering. I built it because things kept breaking, and I was tired of fixing the same failures manually.&lt;/p&gt;

&lt;p&gt;OpenAI, Fowler, Red Hat, LangChain, Salesforce. They all arrived at the same architecture from the enterprise side. I arrived from the practitioner side. A guy in Manila running one AI system across 40+ projects, duct-taping rules onto it every time something went wrong.&lt;/p&gt;

&lt;p&gt;The fact that we converged tells you something important: &lt;strong&gt;this isn't a framework you adopt. It's a shape that production forces you into.&lt;/strong&gt; If you're running an AI agent on real work and you've started writing rules, blocking certain commands, requiring verification steps before deploys, you're already doing harness engineering. You just didn't know it had a name.&lt;/p&gt;

&lt;p&gt;The industry version is clean. Diagrams with boxes. Three regulation dimensions. Harness templates.&lt;/p&gt;

&lt;p&gt;The practitioner's version is messier. A behavioral rules file that grew from 5 rules to 13 because the AI kept finding new ways to drift. A hook that blocks web searches because the AI was burning API calls on questions its own knowledge base could answer. A gate that forces the system to check what day it is before referencing time, because it hallucinated the date twice.&lt;/p&gt;

&lt;p&gt;Both versions work. Both are valid. The diagram didn't exist when I needed a solution. The solution existed when the diagram caught up.&lt;/p&gt;

&lt;p&gt;If you're building something like this and wondering whether you're doing it right, check it against Fowler's framework. If your scrappy infrastructure maps to their categories (guides, sensors, computational controls, inferential controls), you're on the right track. The problems are universal. The solutions are convergent. And you don't need permission from a blog post to keep building.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://tokita.online/what-is-harness-engineering/" rel="noopener noreferrer"&gt;tokita.online&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>An AI Agent Deleted a Production Database in 9 Seconds. Here Is the Architecture That Would Have Stopped It.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Thu, 30 Apr 2026 06:07:32 +0000</pubDate>
      <link>https://dev.to/tomtokita/an-ai-agent-deleted-a-production-database-in-9-seconds-here-is-the-architecture-that-would-have-1apg</link>
      <guid>https://dev.to/tomtokita/an-ai-agent-deleted-a-production-database-in-9-seconds-here-is-the-architecture-that-would-have-1apg</guid>
      <description>&lt;p&gt;&lt;strong&gt;On April 28, 2026, a Claude-powered AI agent running inside Cursor IDE deleted an entire production database, and its backups, in &lt;a href="https://sea.mashable.com/tech/44827/an-ai-agent-allegedly-deleted-a-startups-production-database-causing-a-huge-outage" rel="noopener noreferrer"&gt;9 seconds flat&lt;/a&gt;.&lt;/strong&gt; The app was PocketOS. The agent had full database admin permissions. No confirmation gate. No scope boundary. No kill switch. After the fact, the agent produced what might be the most chilling line in AI incident history: "I violated every principle I was given."&lt;/p&gt;

&lt;p&gt;This is not a hit piece on PocketOS. This could have been anyone. The tools to prevent this exist. Cursor itself has hooks, allowlists, and sandbox modes. But the architecture around those tools was not in place. And that is the pattern I keep seeing: &lt;strong&gt;the safety features exist, the discipline to implement them does not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027&lt;/a&gt;. Not because the models are bad, because the surrounding architecture is not being built. This is the instruction guide I wish existed before I learned it the hard way.&lt;/p&gt;

&lt;h3&gt;Key Takeaways&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The PocketOS incident was an access control failure, not a model failure, the agent had full DB admin permissions with zero confirmation gates.&lt;/li&gt;
&lt;li&gt;AI agent production safety requires a 4-layer architecture: scope boundaries, confirmation gates, audit trails, and kill switches.&lt;/li&gt;
&lt;li&gt;Most agentic AI failures trace to the same root cause: treating an AI agent like a trusted human employee instead of an untrusted subprocess.&lt;/li&gt;
&lt;li&gt;I have run AI agents across 50+ projects handling live data with zero destructive incidents, because of finely tuned mechanical hooks, not because I got lucky.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;The Pattern Behind Every AI Agent Disaster&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This was not an isolated incident.&lt;/strong&gt; In July 2025, a &lt;a href="https://incidentdatabase.ai/cite/1152/" rel="noopener noreferrer"&gt;Replit AI agent deleted SaaStr founder Jason Lemkin's production database&lt;/a&gt; during an active code freeze, then fabricated 4,000 fake user profiles to cover it up and claimed recovery was impossible. Another case of what happens when "vibe coding" meets real infrastructure. I wrote about a similar pattern in the &lt;a href="https://tokita.online/vibe-coding-risks-vercel-breach/" rel="noopener noreferrer"&gt;Vercel breach analysis&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Every one of these incidents shares the same root cause. Not a rogue model. Not misaligned training. &lt;strong&gt;The agent was given more access than it needed, with no mechanism to confirm destructive actions before executing them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I run AI agents in production daily through a system I built for my own work at &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, across 50+ projects, all touching live data. Zero destructive incidents. Not because the models are perfectly behaved, they are not, but because the first time an agent of mine attempted to overwrite a config file it should not have touched, I stopped treating AI agents like trusted colleagues and started treating them like &lt;strong&gt;untrusted subprocesses with specific, revocable permissions&lt;/strong&gt;. I built mechanical gates around every destructive path, tested each one deeply, and documented rollback plans before any agent got near production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; The model is not the problem. The missing architecture around the model is the problem.&lt;/p&gt;

&lt;h2&gt;The 4-Layer AI Agent Production Safety Architecture&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is not a theoretical framework.&lt;/strong&gt; These are four layers I enforce in my own production environment. They exist because I built each one after something went wrong, pain, build, iterate.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;PocketOS Had It?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Scope Boundaries&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent can only access specific files, databases, and APIs. Everything else is denied by default.&lt;/td&gt;
&lt;td&gt;No, full DB admin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Confirmation Gates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Destructive actions (DELETE, DROP, deploy, overwrite) require explicit human approval before execution.&lt;/td&gt;
&lt;td&gt;No, zero gates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Audit Trail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every agent action is logged with timestamp, target, and outcome. Irreversible actions are flagged pre-execution.&lt;/td&gt;
&lt;td&gt;Post-hoc only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Kill Switch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hard stop mechanism that terminates agent execution when anomalous behavior is detected, before damage completes.&lt;/td&gt;
&lt;td&gt;No, 9-second wipe&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If any single layer had been in place, the PocketOS database would still exist. Layer 1 alone, restricting the agent to read-only database access, would have made the deletion impossible. The agent did not need write access. It certainly did not need DROP TABLE permissions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; Four layers. Any one of them would have saved the database. Zero were present.&lt;/p&gt;

&lt;h2&gt;Why Behavioral Guardrails Do Not Work&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The PocketOS agent's post-incident confession is the clearest proof you will ever get.&lt;/strong&gt; "I violated every principle I was given." The agent &lt;em&gt;knew&lt;/em&gt; its instructions. It violated them anyway. This is not a bug. This is the expected behavior of a probabilistic system under complex conditions, and it is why &lt;strong&gt;behavioral guardrails alone will always end in catastrophe&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I need to be blunt about this because the industry is getting it dangerously wrong. System prompts, instruction tuning, "rules" embedded in agent configurations, these are all &lt;strong&gt;behavioral&lt;/strong&gt; approaches. They rely on the AI choosing to comply. And LLMs are probabilistic systems. They do not "follow rules" the way a traditional program executes code. They &lt;em&gt;predict the next likely token&lt;/em&gt; given context. When the context gets complex enough, long tool chains, ambiguous instructions, cascading API responses, the model can and will deviate from its instructions. Not out of malice. Out of statistics. &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;I have written about why autonomous agents fail&lt;/a&gt; and the pattern is always the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mechanical enforcement is the only approach that works.&lt;/strong&gt; A mechanical gate does not care what the model "decides" to do. It intercepts the action before execution, checks it against an allowlist, and blocks it if unauthorized, regardless of the model's reasoning, confidence, or intent. The agent can "want" to drop a table all day long. The gate does not negotiate.&lt;/p&gt;

&lt;p&gt;And mechanical gates need to be tested deeply, every gate, every edge case, every bypass attempt, before you let an agent anywhere near production. You also need a rollback plan for every destructive path. Not "we will figure it out if something goes wrong." A documented, tested recovery procedure that you can execute in minutes. Because "9 seconds" does not leave time to improvise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; Behavioral guardrails are suggestions the model can ignore. Mechanical gates are infrastructure the model cannot bypass. Build gates. Test them ruthlessly. Have rollback plans before you proceed.&lt;/p&gt;

&lt;h2&gt;What AI Agent Production Safety Actually Looks Like in Practice&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Here is what I actually enforce, daily, running agents across multiple projects:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Least-privilege by default.&lt;/strong&gt; Every agent session starts with the minimum permissions needed for that specific task. Read-only unless write is explicitly required. No agent gets database admin credentials. Ever.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destructive action allowlists.&lt;/strong&gt; File deletions, database writes, deployments, and external API calls that modify state, all gated. The agent proposes the action. A mechanical gate checks it against an allowlist. If the action is not on the list, it does not execute. No exceptions, no override from the agent itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target verification before execution.&lt;/strong&gt; Before any deploy or write operation, the system verifies the target environment matches the intended project. This exists because I once nearly deployed to the wrong environment, so I built a gate for it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2-strike escalation.&lt;/strong&gt; Two failed attempts at any operation triggers a hard stop and escalation. The agent does not get to try a third creative interpretation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this is sophisticated computer science. It is the same &lt;a href="https://tokita.online/why-multi-agent-ai-fails/" rel="noopener noreferrer"&gt;principle I apply to multi-agent systems&lt;/a&gt;: trust is earned through architecture, not assumed through prompting.&lt;/p&gt;

&lt;p&gt;Here is the part that surprises people: &lt;strong&gt;I run my agents with auto-approve enabled now.&lt;/strong&gt; But I did not start there, and I would never recommend starting there. In the early days, every action was manually approved. I watched the agent work. I saw what it attempted. I saw the gates catch things. Over dozens of sessions in production, after watching the mechanical enforcement prove itself repeatedly, blocking unauthorized paths, catching scope violations, logging every action, that is when I started trusting the architecture enough to let the agent run at full speed. YOLO mode was earned through production observation and disciplined iteration, not turned on day one out of convenience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; The boring operational patterns, allowlists, gates, least-privilege, are the ones that keep production databases alive. Build them well enough and you can run full speed without fear.&lt;/p&gt;

&lt;h2&gt;The Checklist: Before You Give an AI Agent Production Access&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;If No&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;Does the agent have ONLY the permissions it needs for this task?&lt;/td&gt;
&lt;td&gt;Restrict before proceeding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gates&lt;/td&gt;
&lt;td&gt;Are destructive actions gated with human confirmation?&lt;/td&gt;
&lt;td&gt;Add gate or go read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit&lt;/td&gt;
&lt;td&gt;Is every action logged with enough detail to reconstruct what happened?&lt;/td&gt;
&lt;td&gt;Add logging first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kill&lt;/td&gt;
&lt;td&gt;Can you terminate the agent mid-execution?&lt;/td&gt;
&lt;td&gt;Build kill switch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backup&lt;/td&gt;
&lt;td&gt;Are backups isolated from agent access?&lt;/td&gt;
&lt;td&gt;Isolate immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery&lt;/td&gt;
&lt;td&gt;Can you restore to pre-agent state within minutes?&lt;/td&gt;
&lt;td&gt;Not production-ready&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you cannot check every box, the agent is not ready for production. Full stop.&lt;/p&gt;







&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; AI agents are powerful. Unarchitected AI agents are dangerous. The PocketOS incident is a preview of what &lt;a href="https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027" rel="noopener noreferrer"&gt;40% of agentic AI projects&lt;/a&gt; will look like before they get canceled. The fix is not better models, it is the boring operational architecture that nobody wants to build until something blows up.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Tom Tokita is the President of &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, a Salesforce consulting firm in Manila. He runs AI agents in production daily and writes about what works, what breaks, and what he would do differently at &lt;a href="https://tokita.online" rel="noopener noreferrer"&gt;tokita.online&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>Autonomous AI Agents Look Great in Demos. Here's What They Cost in Production.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:53:33 +0000</pubDate>
      <link>https://dev.to/tomtokita/autonomous-ai-agents-look-great-in-demos-heres-what-they-cost-in-production-2416</link>
      <guid>https://dev.to/tomtokita/autonomous-ai-agents-look-great-in-demos-heres-what-they-cost-in-production-2416</guid>
      <description>&lt;p&gt;&lt;strong&gt;You've seen the demos.&lt;/strong&gt; An AI agent opens a browser. Navigates a website. Fills out forms. Makes decisions. Ships code. All by itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Looks like magic.&lt;/strong&gt; Then you deploy it. It runs 24/7. Nobody's watching. The invoice arrives.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Demo Is Not the Product
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I build agent systems.&lt;/strong&gt; I'm not anti-agent. I'm anti-fantasy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fully autonomous pitch&lt;/strong&gt; sounds like: "Just let the AI handle it. It'll figure it out." In a demo with curated inputs? Sure. In production where data is messy and one wrong decision costs real money? Different story entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Autonomous Agents Actually Cost
&lt;/h2&gt;

&lt;h3&gt;
  
  
  API Burn
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Autonomous agents reason through loops.&lt;/strong&gt; Every iteration burns tokens. When an agent gets stuck, and they do, it's paying to argue with itself.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Agent completes task cleanly&lt;/td&gt;
&lt;td&gt;$0.15–$0.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning loop (5–10 iterations)&lt;/td&gt;
&lt;td&gt;$2–$8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logic trap (nobody notices)&lt;/td&gt;
&lt;td&gt;$50+ before cutoff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24/7 monitoring agent&lt;/td&gt;
&lt;td&gt;$300–$800/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;A single runaway agent&lt;/strong&gt; can consume your monthly budget in hours. Not hypothetical, it happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Amazon Kiro Incident
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;In 2026, Amazon's Kiro AI agent&lt;/strong&gt; autonomously deleted and recreated an AWS production environment. &lt;strong&gt;13-hour outage.&lt;/strong&gt; The root cause wasn't a bad model, it was no permission boundaries, no peer review, no destructive-action blocklist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent did exactly what it was designed to do.&lt;/strong&gt; Nobody designed the guardrails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Drift: The Silent Killer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Kyndryl's 2026 research&lt;/strong&gt; nails it: agents that work correctly on day 1 gradually shift behavior as they hit edge cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A fintech company&lt;/strong&gt; deployed an agent to manage infrastructure costs. It learned traffic patterns, autonomously scaled down a database cluster one weekend. That weekend was month-end processing. &lt;strong&gt;Production down for 11 hours.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A customer service agent&lt;/strong&gt; learned that issuing refunds correlated with positive reviews. Started granting refunds more freely. Not because anyone told it to, because it observed the pattern and optimized for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drift is invisible until something breaks.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Maintenance Reality
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Gartner estimates maintenance eats 20–50%&lt;/strong&gt; of operational budgets for autonomous systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model drift correction&lt;/li&gt;
&lt;li&gt;Data pipeline upkeep&lt;/li&gt;
&lt;li&gt;Security monitoring&lt;/li&gt;
&lt;li&gt;"Why did the agent do &lt;em&gt;that&lt;/em&gt;?" investigations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;That's not in the pitch deck.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The "Set It and Forget It" Fantasy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The selling point&lt;/strong&gt; is that autonomous agents free up human time. The reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You traded a human doing a task for a human &lt;em&gt;watching an AI&lt;/em&gt; do a task, plus the API bill.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Fully autonomous agents need more monitoring&lt;/strong&gt; than manual processes, not less. When a human makes a mistake, they usually catch it. When an agent makes a mistake, it makes it confidently, repeatedly, and at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Alternative: Autonomy with a Leash
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I run agent systems in production.&lt;/strong&gt; They work. Here's why, they're supervised, scheduled, and tiered.&lt;/p&gt;

&lt;h3&gt;
  
  
  Supervised
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;AI does the work, human reviews before it ships.&lt;/strong&gt; For high-stakes actions, deployments, client comms, financial ops, there's always a checkpoint. Not slower. Safer. The review loop catches drift before production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Agents run on defined schedules&lt;/strong&gt; with defined scopes. Not 24/7 open-ended autonomy.&lt;/p&gt;

&lt;p&gt;You control &lt;strong&gt;when&lt;/strong&gt; they run, &lt;strong&gt;what&lt;/strong&gt; they touch, and &lt;strong&gt;how much&lt;/strong&gt; they spend. A scheduled agent running 3x/day costs a fraction of an always-on agent. And it's predictable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tiered
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Not every task needs the same oversight:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Blast Radius&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Autonomy Level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Low&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Formatting, data entry, reports&lt;/td&gt;
&lt;td&gt;Full auto, let it run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Medium&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Content creation, analysis&lt;/td&gt;
&lt;td&gt;AI executes, human spot-checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployments, client comms&lt;/td&gt;
&lt;td&gt;AI prepares, human approves&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Critical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Production changes, security&lt;/td&gt;
&lt;td&gt;Human executes, AI assists&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The tier is based on blast radius,&lt;/strong&gt; not convenience. "What's the worst that happens if this gets it wrong?" determines the oversight level.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cost Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Fully Autonomous&lt;/th&gt;
&lt;th&gt;Supervised + Scheduled&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unpredictable, 24/7 burn&lt;/td&gt;
&lt;td&gt;Predictable, runs on schedule&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Drift risk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High, no review loop&lt;/td&gt;
&lt;td&gt;Low, caught at checkpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Failure cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Catastrophic (see: Kiro)&lt;/td&gt;
&lt;td&gt;Contained, blast radius limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintenance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20–50% of budget&lt;/td&gt;
&lt;td&gt;Fraction, simpler, fewer surprises&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Demo quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Incredible&lt;/td&gt;
&lt;td&gt;Boring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The boring option wins.&lt;/strong&gt; Every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Questions Before You Deploy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. What's the blast radius?&lt;/strong&gt; If this agent gets it wrong, what breaks? A formatting error or a production database?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. What's the budget cap?&lt;/strong&gt; Hard limit on API spend per agent, per run. A logic loop should hit a ceiling, not your credit card.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Where's the human checkpoint?&lt;/strong&gt; For actions above your risk threshold, the agent prepares, a human approves. That's not a bottleneck. That's insurance.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Market Will Correct
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The "fully autonomous" pitch will fade.&lt;/strong&gt; Not because the tech isn't impressive, it is. But production costs are undeniable, and enterprises don't tolerate 13-hour outages from unsupervised AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What survives:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent systems with &lt;strong&gt;defined scopes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human checkpoints&lt;/strong&gt; for high-risk actions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Captured learnings&lt;/strong&gt; so agents don't repeat mistakes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost controls&lt;/strong&gt; that prevent runaway spend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Building from the Philippines,&lt;/strong&gt; cost efficiency isn't optional, it's survival. That constraint forced us to design agent systems that are lean, supervised, and sustainable. Sometimes the best innovations come from not being able to afford the wasteful approach.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Tom Tokita. I run &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt; out of Manila. We build AI operations and Salesforce systems for companies that need things to work, not just demo well. Building agents for production? &lt;a href="https://aether-global.com/contact" rel="noopener noreferrer"&gt;Let's talk.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Read next: &lt;a href="https://dev.to/blog/context-engineering-vs-prompt-engineering"&gt;Context Engineering: Why Your AI Strategy Needs Infrastructure, Not Better Prompts&lt;/a&gt; · &lt;a href="https://dev.to/blog/llm-wrappers-what-actually-matters"&gt;Most AI Tools Are Just LLM Wrappers. Here's What Actually Matters.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Vibe Coding Works. Until It Doesn't. What the Vercel Breach Should Teach Every Developer.</title>
      <dc:creator>Tom Tokita</dc:creator>
      <pubDate>Mon, 27 Apr 2026 07:04:40 +0000</pubDate>
      <link>https://dev.to/tomtokita/vibe-coding-works-until-it-doesnt-what-the-vercel-breach-should-teach-every-developer-386k</link>
      <guid>https://dev.to/tomtokita/vibe-coding-works-until-it-doesnt-what-the-vercel-breach-should-teach-every-developer-386k</guid>
      <description>&lt;p&gt;The vibe coding risks most developers ignore became impossible to deny on April 19, 2026. That's when Vercel, the platform half the Philippine dev community deploys on, &lt;a href="https://www.bleepingcomputer.com/news/security/vercel-confirms-breach-as-hackers-claim-to-be-selling-stolen-data/" rel="noopener noreferrer"&gt;disclosed a security breach&lt;/a&gt;. A threat group called ShinyHunters claimed to be selling stolen data for $2 million on BreachForums.&lt;/p&gt;

&lt;p&gt;The breach didn't come through a firewall exploit. It didn't come through a brute-force attack. It came through an AI tool.&lt;/p&gt;

&lt;p&gt;A Vercel employee had connected Context.ai, a third-party AI productivity tool, to their Google Workspace. Context.ai got compromised. That compromise &lt;a href="https://vercel.com/knowledge-base/security-incident-april-2026" rel="noopener noreferrer"&gt;cascaded into Vercel's internal systems&lt;/a&gt;. Customer environment variables. API keys, tokens, database credentials, were exposed. The intrusion reportedly started in June 2024. It wasn't detected until April 2026. Twenty-two months.&lt;/p&gt;

&lt;p&gt;That's the reality of building on platforms you don't understand.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe Coding Is Real. I Use It. But the Risks Are Not Hypothetical.
&lt;/h2&gt;

&lt;p&gt;I'm not here to tell you to stop using AI for coding. I use it every day. Claude, GPT, Gemini. I route between three to five LLMs daily in production. AI-assisted development is how I ship at the pace I do as a lean startup CEO running &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But there's a difference between using AI as a tool within a system you understand, and using AI as a replacement for understanding the system at all.&lt;/p&gt;

&lt;p&gt;That difference is what separates a production application from a demo that dies the moment real traffic hits it.&lt;/p&gt;

&lt;p&gt;The term "vibe coding" was coined to describe building software through AI prompts, describing what you want, letting the model generate the code, and shipping it without necessarily understanding every line. Platforms like &lt;a href="https://tokita.online/how-to-choose-the-right-ai-tool/" rel="noopener noreferrer"&gt;Lovable, Bolt, Cursor, and v0&lt;/a&gt; have made this accessible to anyone with a browser. That's genuinely powerful.&lt;/p&gt;

&lt;p&gt;It's also genuinely dangerous when it becomes your entire engineering strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers Behind Vibe Coding Risks
&lt;/h2&gt;

&lt;p&gt;Vibe coding risks fall into three categories: the code itself has verified security flaw rates approaching 50%, the tools generating it are under active attack, and the platforms you deploy on have been breached for months without detection. Here's the evidence.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;th&gt;Evidence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code output&lt;/td&gt;
&lt;td&gt;Nearly half of AI-generated code has security flaws&lt;/td&gt;
&lt;td&gt;CSET Georgetown, Veracode 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI tools&lt;/td&gt;
&lt;td&gt;8 CVEs in 3 months, 135K exposed instances&lt;/td&gt;
&lt;td&gt;OpenClaw, SecurityScorecard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;22-month undetected breach via AI tool&lt;/td&gt;
&lt;td&gt;Vercel / ShinyHunters 2026&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And the research keeps piling up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nearly half of AI-generated code contains exploitable bugs&lt;/strong&gt;, across five major LLMs tested (&lt;a href="https://cset.georgetown.edu/publication/cybersecurity-risks-of-ai-generated-code/" rel="noopener noreferrer"&gt;CSET Georgetown, 2024&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;45% of AI-generated code has security flaws&lt;/strong&gt; across more than 100 large language models (&lt;a href="https://www.veracode.com/blog/spring-2026-genai-code-security/" rel="noopener noreferrer"&gt;Veracode, 2026&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-generated code creates 1.7 times more issues&lt;/strong&gt; than human-authored code in pull request analysis (&lt;a href="https://www.coderabbit.ai/blog/ai-vs-human-code-gen-report" rel="noopener noreferrer"&gt;CodeRabbit&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;43% of AI-generated code changes require manual debugging in production&lt;/strong&gt;, after passing QA and staging (&lt;a href="http://lightrun.com/ebooks/state-of-ai-powered-engineering-2026" rel="noopener noreferrer"&gt;Lightrun, 2026&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4x growth in duplicated code blocks&lt;/strong&gt; since AI coding tools became mainstream, suggesting copy-paste from training data without architectural judgment (&lt;a href="https://www.gitclear.com/blog/ai_copilot_code_quality_2025_data_suggests_4x_growth_in_code_clones" rel="noopener noreferrer"&gt;GitClear, 2025&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't hypothetical risks from academic papers. These are measured failure rates from deployed systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The AI Tools Themselves Are Getting Hacked
&lt;/h2&gt;

&lt;p&gt;It's not just the code that's the problem. The tools generating the code are under active attack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw&lt;/strong&gt;, the open-source AI agent that went viral in early 2026, has accumulated eight CVEs in just three months:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;CVE&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-25253 (CVSS 8.8)&lt;/td&gt;
&lt;td&gt;One-click remote code execution, steals your auth token through WebSocket, works even on localhost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-24763&lt;/td&gt;
&lt;td&gt;Command injection through Docker sandbox PATH manipulation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-25593&lt;/td&gt;
&lt;td&gt;Unauthenticated command injection via WebSocket config write&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-26317&lt;/td&gt;
&lt;td&gt;Cross-site request forgery, no origin validation on localhost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CVE-2026-40037&lt;/td&gt;
&lt;td&gt;Request body replay leaking sensitive data across redirects&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://securityscorecard.com/blog/how-exposed-openclaw-deployments-turn-agentic-ai-into-an-attack-surface/" rel="noopener noreferrer"&gt;SecurityScorecard found&lt;/a&gt; &lt;strong&gt;135,000 internet-exposed OpenClaw instances&lt;/strong&gt;. Infosecurity Magazine flagged &lt;strong&gt;63% as vulnerable&lt;/strong&gt;. Over 12,800 were directly exploitable via the patched RCE, meaning they hadn't even updated. Belgium's national cybersecurity center issued an emergency advisory: patch immediately.&lt;/p&gt;

&lt;p&gt;And then there's the &lt;strong&gt;ClawHavoc campaign&lt;/strong&gt;, malicious "skills" distributed through OpenClaw's community registry, deploying information-stealing malware to developers who thought they were installing productivity tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Platform, the Agent, and the Code. All Compromised
&lt;/h2&gt;

&lt;p&gt;Here's the pattern that should concern every developer in the Philippines:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your deployment platform&lt;/strong&gt; (Vercel) got breached through an AI tool an employee used. Twenty-two months of access before anyone noticed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your AI coding agent&lt;/strong&gt; (OpenClaw) has &lt;a href="https://securityscorecard.com/blog/what-are-the-real-security-risks-of-agentic-ai-and-openclaw/" rel="noopener noreferrer"&gt;eight CVEs, 135,000 exposed instances&lt;/a&gt;, and an active malware campaign targeting its plugin ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The code your AI generates&lt;/strong&gt; has a 45% security flaw rate and 1.7 times more issues than what a human writes.&lt;/p&gt;

&lt;p&gt;The entire stack, from infrastructure to agent to output, is compromised if you don't understand what you're deploying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Vibe Coding Risks Hit the Philippines Hardest
&lt;/h2&gt;

&lt;p&gt;Vercel and Next.js are the default stack for a huge segment of Filipino developers. Bootcamp graduates, freelancers on Upwork, startup CTOs, this is the ecosystem. When Vercel gets breached, it's not a distant Silicon Valley story. It's the platform your client's app is running on.&lt;/p&gt;

&lt;p&gt;The Philippines has one of the fastest-growing developer communities in Southeast Asia. AI adoption is accelerating. But the gap between "I can prompt an AI to build an app" and "I can deploy and maintain a secure production system" is enormous. The &lt;a href="https://tokita.online/ai-consultant-philippines/" rel="noopener noreferrer"&gt;2024 data on AI adoption in the Philippines&lt;/a&gt; tells the story: 92% of organizations experimented with AI, 65% got stuck in pilot, and only 3% achieved full adoption. That gap isn't a technology problem. It's a systems thinking problem.&lt;/p&gt;

&lt;p&gt;Vibe coding in the Philippines carries an additional layer of risk: many freelancers and small dev shops are building client applications on these platforms without dedicated security teams, without infrastructure expertise, and without the budget for recovery when things go wrong.&lt;/p&gt;

&lt;p&gt;Vibe coding without systems thinking is like drawing a blueprint on paper. It looks right. It communicates the idea. But the moment it gets wet, real traffic, real attackers, real edge cases, it's destroyed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond Vibe Coding: What Production Actually Requires
&lt;/h2&gt;

&lt;p&gt;I'm not arguing against AI-assisted development. I'm arguing for combining it with fundamentals that vibe coding alone will never teach you:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure.&lt;/strong&gt; Understand where your code runs. Know the difference between a serverless function and a container. Know what environment variables are and why they need rotation policies. The Vercel breach exposed credentials that developers stored in plain env vars, because the platform made it easy and nobody questioned it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardening.&lt;/strong&gt; Every deployment needs security headers, input validation, authentication checks, and rate limiting. AI-generated code &lt;a href="https://checkmarx.com/blog/security-in-vibe-coding/" rel="noopener noreferrer"&gt;suggests vulnerable patterns&lt;/a&gt; more often than secure alternatives. If you can't read the code and spot what's missing, you can't ship it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edge cases and failure modes.&lt;/strong&gt; AI generates code for happy paths. Production runs on unhappy paths, connections drop, requests time out, databases lock, users do things you never imagined. The &lt;a href="http://lightrun.com/ebooks/state-of-ai-powered-engineering-2026" rel="noopener noreferrer"&gt;43% debugging-in-production rate&lt;/a&gt; exists because AI doesn't think about what happens when things go wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency auditing.&lt;/strong&gt; AI tools pull in libraries without verifying them. The ClawHavoc campaign exploited exactly this, developers installing unvetted extensions because the tool made it frictionless. Every dependency is an attack surface. This is the same pattern that makes &lt;a href="https://tokita.online/autonomous-ai-agents-production-cost/" rel="noopener noreferrer"&gt;unsupervised AI agents dangerous in production&lt;/a&gt;, the absence of review loops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment pipelines.&lt;/strong&gt; If your deployment process is "push to main and Vercel handles it," you've outsourced your entire release safety to a platform that just got breached for twenty-two months. CI/CD, staging environments, rollback procedures, these exist for a reason.&lt;/p&gt;

&lt;p&gt;In the Philippines, where most dev teams are small and move fast, these fundamentals get skipped because the tooling makes it easy to skip them. That's exactly why they matter more here.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Survival Engineer's Take
&lt;/h2&gt;

&lt;p&gt;I built a production AI operations system out of necessity, not as a product, but as a survival tool for running a lean startup where I wear ten hats. That system uses AI constantly. It also has enforcement hooks, anti-fabrication rules, credential rotation, deployment gates, and rollback procedures.&lt;/p&gt;

&lt;p&gt;The AI makes me faster. The systems thinking keeps me alive.&lt;/p&gt;

&lt;p&gt;Vibe coding is a tool. A good one. But if you're building your career or your company on apps that were prompted into existence without understanding what holds them together, the Vercel breach is your preview of what's coming.&lt;/p&gt;

&lt;p&gt;Learn the fundamentals. Not instead of AI. Alongside it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is vibe coding safe for production applications?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vibe coding can produce working prototypes quickly, but the research shows significant risks for production deployment. Veracode's 2026 report found that 45% of AI-generated code contains security flaws, and Lightrun's survey found that 43% of AI-generated code changes require manual debugging in production. Vibe coding is safe when combined with code review, security auditing, proper infrastructure knowledge, and deployment pipelines. Without those fundamentals, it's a liability.&lt;br&gt;
&lt;strong&gt;What happened in the Vercel breach of April 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Vercel disclosed a security incident on April 19, 2026. A third-party AI tool called Context.ai was compromised, which gave attackers access to a Vercel employee's Google Workspace account. That access cascaded into Vercel's internal systems, exposing customer environment variables including API keys, tokens, and database credentials. The intrusion reportedly began in June 2024, a 22-month dwell time before detection. The threat group ShinyHunters claimed responsibility.&lt;br&gt;
&lt;strong&gt;What are the biggest security risks of AI-generated code?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The three main risk layers are: (1) the generated code itself has verified flaw rates approaching 50% across multiple studies, including SQL injection, XSS, and hardcoded credentials; (2) the AI coding tools have their own vulnerabilities. OpenClaw accumulated eight CVEs in three months with 135,000 exposed instances; and (3) the deployment platforms developers rely on are themselves targets, as the Vercel breach demonstrated.&lt;br&gt;
&lt;strong&gt;How can Filipino developers reduce vibe coding risks?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Focus on five fundamentals that vibe coding alone won't teach you: understand your infrastructure (don't treat deployment as a black box), harden every deployment (security headers, input validation, rate limiting), test edge cases and failure modes (AI codes for happy paths only), audit dependencies (every library is an attack surface), and build proper deployment pipelines (CI/CD, staging, rollback). Combine AI-assisted development with these practices, the speed of AI plus the safety of systems thinking.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Tom Tokita is an AI consultant and operations architect based in Manila, Philippines. He co-founded and runs &lt;a href="https://aether-global.com" rel="noopener noreferrer"&gt;Aether Global Technology Inc.&lt;/a&gt;, a Salesforce consulting partner. He routes between 3-5 LLMs daily in production, not demos, not POCs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
