DEV Community: Tom Tokita

Most AI Tools Are Just LLM Wrappers. Here's What Actually Matters.

Tom Tokita — Tue, 19 May 2026 00:36:13 +0000

In 2025, AI wrapper startups raised over $10 billion. The product? Take an LLM API. Add a text box. Maybe some prompt templates. Charge $30/month. Call it "AI-powered."

Not mad at the hustle. But if your entire product disappears the moment ChatGPT adds your feature for free, you don't have a product. You have a timing play.

The Wrapper Test

One question tells you everything:

Can you replicate the output by pasting the same input into ChatGPT or Claude?

If yes: it's a wrapper. You're paying for UI and convenience, not intelligence.

If no: because it's pulling from multiple data sources, applying domain logic, or integrating with real systems, it might be something real.

Most fail the test.

Thin vs. Thick

Not all wrappers are equal. The market is splitting fast:

	Thin Wrapper	Thick Wrapper
What it does	UI + API call + system prompt	Real integrations, domain logic, data pipelines
Defensibility	None. One platform update kills it	High. Value is in the connectors
Example	"AI email writer" (GPT call with a system prompt)	Cursor (reads your codebase, understands project context)
Survival odds	Low	Decent

The graveyard of 2025–2026 is littered with thin wrappers that a platform update made irrelevant overnight.

What Actually Matters

Strip away the wrapper. Where does the real value live?

1. Connectors

The ability to talk to real systems: Salesforce, Jira, databases, email, file storage, APIs. This is where 80% of the actual work lives.

Getting an AI to generate text is trivial. Getting it to read your CRM records, cross-reference tickets, update a database, and notify Slack. That's integration work. That's hard. That's valuable.

Most wrappers don't touch this. They live in the text-in, text-out world.

2. Captured Domain Expertise

An AI that's been learning your industry's quirks for months is worth more than a fresh GPT-5 instance with a clever prompt.

	Fresh AI + Great Prompt	AI + 6 Months of Learnings
Platform quirks	Discovers them painfully	Already knows them
Common mistakes	Makes them all	Has guardrails for each
Your terminology	Constant correction needed	Uses it naturally
Edge cases	Surprised every time	Documented patterns

The knowledge compounds. Every session, every bug fix, every "oh, that's how this actually works" gets captured and fed back.

No wrapper captures this. They start fresh every time.

3. Methodology

How you approach problems with AI matters more than which model you use.

The wrapper approach: open tool → type request → get output → hope it's right.

The practitioner approach:

Small test: constrained input, see what happens
Evaluate: what worked? What broke?
Capture: document the learning
Adjust: update the approach
Repeat

The tool is 10%. The methodology is 90%.

The "Just Build It" Case

Here's the uncomfortable truth. Building your own system (even ugly, even scrappy) gives you something no wrapper provides: understanding.

You know why it works. Why it breaks. How to fix it. When the model changes (and it will), you swap the engine. The connectors, the learnings, the guardrails. Those persist. They're yours.

Cost at scale:

	Wrapper Stack	Custom (Direct API)
Month 1	$150/seat, fast setup	$500 dev time, slower start
Month 6	$150/seat, same capabilities	$50/month API, growing capabilities
Year 1 (5 seats)	$9,000	~$3,100 + compound knowledge

Custom costs less AND gets smarter. The wrapper costs the same and stays the same.

The Philippines advantage: smaller teams with direct API access can outperform larger orgs paying for wrapper stacks. When you can't afford $150/seat for 6 different AI tools, you build one system that does what you need. That constraint produces better architecture.

When Wrappers DO Make Sense

Fair is fair:

Speed to market: need something running tomorrow without engineering capacity? Wrapper gets you there.
Thick wrappers with real integrations: Cursor, Harvey, Perplexity add genuine value beyond the API call.
Exploration phase: trying 5 wrappers to understand the capability space before building your own is smart R&D.

The key question:

Are you buying a tool or renting a feature?

If the value prop is "we make it easy to talk to an LLM," that feature is getting commoditized in real time. Every model provider is making their native interface better, faster, cheaper.

What to Build Instead

Ready to go beyond wrappers? Start here:

1. Map your connectors. What systems does your AI need to talk to? Build those integrations first. Hardest part. Most valuable.

2. Capture everything. Every platform quirk. Every failed approach. Every successful pattern. Your AI should learn from your organization's experience, not start fresh every session.

3. Own your methodology. Document how you approach problems with AI. Small tests → captured learnings → iteration. More valuable than any tool you can buy.

4. Accept ugly. The most effective AI systems I've built are not pretty. Config files, markdown documents, scripts. They look like plumbing. They work like machines.

Bottom Line

The moat isn't the model. It never was.

It's the connectors that talk to your stack. The domain expertise captured over months. The methodology that turns every failure into a lesson.

None of that lives in a wrapper.

I'm Tom Tokita. I run Aether Global Technology out of Manila. We build production AI and Salesforce systems for enterprises that need real integrations, not another wrapper. Let's talk.

The Truth About Agent Swarming: What the Gurus Won't Tell You About Cost, Failure, and Security

Tom Tokita — Sat, 16 May 2026 11:15:26 +0000

Everyone's building "AI agent teams" right now. Five agents, ten agents, a whole swarm collaborating on complex tasks. At least that's what the YouTube thumbnails promise. The reality? Most of these systems are burning money, leaking data, and failing in ways their builders don't even notice until the invoice arrives.

I built a multi-agent system. It runs in production, daily. So I'm not here to tell you agent swarming doesn't work. I'm here to tell you that most of the advice circulating about it is dangerously incomplete.

The Swarm Hype Cycle Is in Full Swing

Open Twitter or YouTube right now and you'll find a hundred tutorials showing you how to spin up a multi-agent team in under 20 minutes. CrewAI, AutoGen, LangGraph. The frameworks keep multiplying. The demos look incredible: agents researching, agents writing, agents reviewing each other's work, all orchestrated into a beautiful pipeline.

Here's what the demos don't show: what happens when you run that pipeline 500 times. Or 5,000 times. Or when one agent hallucinates and the next agent treats that hallucination as fact and passes it downstream to a third agent that takes action on it.

The guru content follows a pattern: show the setup, show one successful run, skip the failure modes, skip the bill, skip the security implications. It's like showing someone how to start a restaurant by filming one perfect dinner service and cutting before the health inspector shows up.

The latest version of this is "I built an entire company in 30 minutes with AI agents." Someone spins up a framework like Paperclip (which, to be fair, has genuinely solid engineering underneath it: heartbeat scheduling, budget caps, task queues, audit trails), and the content that follows makes it sound like you can replace an entire org overnight. The tool isn't the problem. The tool is fine. The problem is the interpretation layer: gurus filming the setup, skipping the part where 48 pre-configured agents wake up every 4 hours on a frontier model and nobody mentions what that costs at the end of the month. Or what happens when agent #23 gets a poisoned input and the other 47 trust its output.

Why Multi-Agent AI Fails in Production

The coordination problem is real and it scales badly. Galileo's research on multi-agent reliability found that adding agents multiplies failure points exponentially. Four agents create six potential failure points, not four. Ten agents create 45. Every agent-to-agent handoff is a place where context gets lost, instructions get misinterpreted, or outputs get corrupted.

CIO reported in March 2026 that true multi-agent collaboration remains largely aspirational. Their testing showed single agents hitting 100% success rates on isolated tasks, while hierarchical multi-agent structures failed 64% of the time and self-organized swarms failed 68%. That's not a rounding error. That's a fundamental coordination tax.

The failure modes I've seen firsthand:

No purpose definition. Agents exist because someone saw a cool demo, not because the task requires decomposition. A single well-prompted agent with good tools will outperform a badly orchestrated team of five every time.
No role boundaries. Two agents stepping on each other's work, or worse, one agent undoing what another just did. Without strict scoping, you get agents arguing in loops, burning tokens while producing nothing.
Cascade failures. Agent A hallucinates a "fact." Agent B cites it. Agent C acts on it. By the time a human reviews the output, three layers of confident-sounding nonsense have compounded. Galileo calls this "propagation of inaccuracies" and it's the single biggest reliability risk in multi-agent systems.

Failure Pattern	What Happens	How It Scales
No purpose definition	Agents do work a single agent could handle	Cost multiplies, quality stays flat
No role boundaries	Agents duplicate or undo each other's work	Token burn scales quadratically with agent count
Cascade hallucination	Bad output propagates through the chain	Compounds per hop. 3 agents = 3 layers of compounded error
Context window overflow	Shared context exceeds model limits, agents lose thread	Every agent's output inflates the shared context for every other agent
Orchestrator bottleneck	Single coordinator becomes the weakest link	Orchestrator complexity grows O(n²) with agent count

The API Bill Nobody Shows You

Every agent in your swarm is an API call. More accurately, every agent is multiple API calls: the initial prompt, the tool calls, the retries, the context-sharing between agents. A five-agent team running on a frontier model isn't 5x the cost of one agent. It's often 10-15x once you factor in coordination overhead.

Stanford's AI Index Report, cited by Monetizely, found that coordination overhead alone accounts for 15-25% of total operational costs in mature multi-agent systems. That's before you count the actual task execution.

Here's how the math works in practice. Say you're running a research-and-write pipeline with five agents (researcher, analyst, writer, editor, fact-checker). Each agent averages 3,000 input tokens and 1,500 output tokens per task. On a frontier model, that's roughly $0.04 per agent per task (pricing as of March 2026; check your provider's current rates). Five agents: $0.20 per task. Sounds cheap, right?

Now add retries (agent disagrees with another agent's output, re-runs). Add context sharing (every agent needs to see what the others produced, and input tokens multiply). Add the orchestrator's overhead. Add recursive thinking where an agent calls itself to refine. In production, that $0.20 task routinely becomes $0.80-$1.50. Run it 100 times a day and you're looking at $80-$150 daily, or $2,400-$4,500 monthly. For a single pipeline.

The gurus never show you the billing dashboard. I've seen my own costs spike 4x in a single day when an agent hit a retry loop that the orchestrator didn't catch. That's the kind of lesson you only learn in production, not in a 20-minute tutorial. I wrote more about what autonomous agents actually cost in production, the single-agent version of this problem, which multi-agent compounds.

The Security Problem Nobody's Talking About

This is the part that genuinely concerns me. People are downloading MCP servers from GitHub, connecting premade agent builders, and giving their swarm access to production databases, file systems, and APIs, without auditing a single line of the code routing their data.

CovertSwarm's January 2026 analysis exposed how agent-to-agent communication can be exploited through prompt injection, where one compromised agent manipulates another agent's behavior through crafted outputs. In a multi-agent system, a single compromised node can cascade manipulation across the entire swarm.

The security gaps I see repeated constantly:

No credential scoping. Every agent gets the same API keys with the same permissions. Your research agent has write access to your production database. Your summarizer can send emails. Why?
No output boundaries. Agent outputs aren't sanitized before being passed to the next agent. That's how prompt injection propagates. A malicious input in a research result becomes an instruction to the next agent.
Unaudited external tools. That MCP server you downloaded because it had 200 GitHub stars? Did you read its source? Do you know where it sends your data? Most people don't. Most AI tools are just wrappers with varying levels of transparency about what happens between your input and the LLM.
No audit trail. When something goes wrong in a five-agent pipeline, can you reconstruct what each agent saw, decided, and produced? Most frameworks don't log at that granularity by default.

What Actually Works (From Someone Who Built One)

I run a multi-agent system in production. It works. But it works because I built it with specific constraints from day one, not because I followed a framework tutorial.

Here's what I've learned, without exposing the blueprint:

Start with a purpose. Every agent in the system exists because a specific task requires it. If a single agent can do the job, a single agent does the job. The question isn't "how many agents can I add?" It's "what's the minimum number of agents that makes this task decomposition actually valuable?"

Run it monitored, not autonomous. The fantasy is agents running completely on their own, 24/7, while you sleep. The reality is that unmonitored agents drift. They develop patterns you didn't intend. They find edge cases your orchestration doesn't handle. Monitor heavily, especially early on.

Set an end date. Bounded execution, not open-ended. An agent swarm should complete its task and stop. "Run this analysis, produce this output, terminate." Not "keep running until I tell you to stop." Open-ended swarms are where costs and drift compound.

Scope each agent's permissions. Every agent gets exactly the access it needs and nothing more. Read-only where possible. No shared credentials. If an agent needs to write to a database, that's a deliberate architectural decision with boundaries, not a default.

Audit every external tool before connecting. Every MCP server, every API integration, every external data source. Read the code, understand the data flow, verify the trust boundaries. If you can't audit it, don't connect it.

The pattern underneath all of this: multi-agent systems work when they're purpose-built by someone who understands every component. They fail when they're assembled from YouTube tutorials by people who are optimizing for "cool demo" instead of "reliable production system."

Frequently Asked Questions

Are multi-agent AI systems worth building?+

Yes, if the task genuinely requires decomposition across specialized roles. Research pipelines, complex analysis workflows, and multi-step processes with distinct skill requirements are legitimate use cases. The problem isn't multi-agent as a concept. It's multi-agent as a default approach when a single well-tooled agent would do the job better, cheaper, and more reliably.

How much does it cost to run a multi-agent AI system?+

It depends on the model, agent count, and task complexity, but multi-agent costs are multiplicative, not additive. A five-agent pipeline on a frontier model can cost 10-15x what a single agent costs per task once you factor in context sharing, retries, and coordination overhead. Stanford's AI Index Report via Monetizely estimates coordination overhead alone accounts for 15-25% of operational costs. Budget for at least 3-5x your single-agent baseline when planning multi-agent deployments.

What are the biggest security risks with AI agent swarms?+

The top risks are unscoped credentials (every agent gets full access instead of minimum required), unaudited external tools (MCP servers and API integrations you didn't read the source for), and agent-to-agent prompt injection (where a compromised agent manipulates others through crafted outputs). CovertSwarm documented how inter-agent trust can be exploited in January 2026.

Should I use CrewAI, AutoGen, or LangGraph for multi-agent AI?+

The framework matters less than the architecture decisions you make within it. All three can produce working multi-agent systems, and all three can produce expensive failures. The questions that actually matter: Do you have a clear purpose for each agent? Are permissions scoped per agent? Do you have monitoring and cost controls? Can you audit every external integration? If you can't answer yes to all four, the framework choice is irrelevant. You'll fail regardless of which one you pick.

The Bottom Line

Agent swarms aren't bad. Unexamined swarms are. The technology works. I use it daily. But it works because every agent has a purpose, every permission is scoped, every external tool is audited, and the whole system runs monitored with bounded execution.

The gap in the current conversation isn't technical capability. It's operational maturity. The frameworks are getting better. The models are getting cheaper. But the advice circulating ("just add more agents") is setting people up to build expensive, insecure systems they don't understand.

Build with purpose. Monitor heavily. Kill when done.

Tom Tokita is the President of Aether Global Technology Inc., a Salesforce consulting firm in Manila. He built a personal AI operations system as his daily driver. Not planned. Engineered out of necessity. He writes about what works, what breaks, and what the industry keeps getting wrong.

Someone Called My AI System a Tool. Then They Showed Me Theirs.

Tom Tokita — Sat, 09 May 2026 16:08:07 +0000

Someone at a conference asked me what I'd been building. I described a system I use daily. Over 200 sessions of accumulated learnings. 45 mechanical hooks that fire before and after every action. Anti-fabrication gates that block the AI from stating anything it hasn't verified. Memory that survives context compression. Deploy protections that physically prevent wrong-target pushes. A behavioral identity that gets re-injected every message so the system doesn't drift into generic assistant mode.

He nodded and said, "Oh, so you built a tool."

Then he described his. "I built something similar," he said. An agent framework. A React dashboard. A task board. Some cron jobs. A dozen agents with names. A job worker that shells out to the agent CLI and captures stdout. He showed me the architecture diagram. Three boxes connected by arrows.

I asked about guardrails. "What do you mean?" I asked what happens when an agent hallucinates a data point and the next agent downstream treats it as fact. He said that hasn't happened yet. I asked about credential scoping. Every agent had the same API keys with the same permissions. I asked what happens when context compresses mid-task. He didn't know what context compression was.

We were not building the same thing.

The Assembly Pattern

This pattern is everywhere right now. Pull an open-source agent framework. Fork a React cockpit from GitHub. Wire them together with a thin HTTP layer. Add some agent definitions with fun names. Ship a demo. Call it "AI infrastructure."

It works in the demo. It works for the screenshot. It even works the first five times you run it.

It stops working when an agent fabricates a statistic and your client reads it. When a retry loop burns $400 in API calls overnight because nothing capped the spend. When an agent with write access to your production database decides to "clean up" records it hallucinated as duplicates.

The assembly is the easy part. The demo is the easy part. What comes after the demo is where the actual engineering lives.

What's Missing From Every Patchwork Build I've Reviewed

I've audited three of these setups in the past year. Internal team builds, partner builds, open-source-assembled stacks. The gaps are identical every time.

What Production Requires	What the Patchwork Has
Pre-action gates (mechanical blocks before execution)	Nothing. Agent output accepted as final answer
Anti-fabrication (every claim must trace to a source)	Nothing. Whatever the LLM says is treated as fact
Anti-drift detection (behavioral correction over long sessions)	Nothing. Agents drift silently
Persistent memory with session recovery	Stateless. Fresh context every run
Captured learnings (compound knowledge over time)	Nothing. Same mistakes are repeatable indefinitely
Credential scoping per agent	Shared keys, full permissions, no boundaries
Human checkpoints on multi-step tasks	Fully autonomous, no review loop

The common response: "We'll add that later." In my experience, later means after the first production incident. And the first production incident in an unharnessed AI system is rarely small.

Assembly Is Not Engineering

I want to be clear. I'm not against using open-source. I use open-source tools constantly. MIT-licensed projects power parts of my own stack. Pulling from the community is smart and efficient.

But there's a gap between assembling components and engineering a system. Assembly is connecting boxes. Engineering is understanding what happens at every connection point when things go wrong. What happens when the model hallucinates at step 3 of a 7-step pipeline? What happens when context compresses and the agent forgets the rules you set 40 messages ago? What happens when an agent gets a poisoned input from an unaudited MCP server?

If you can't answer those questions, you haven't built infrastructure. You've built a demo with a longer runtime.

"I'll Just Have My AI Build It"

This is the part that genuinely worries me.

The assembly pattern is accelerating because people are using AI to do the assembling. "I'll just have Claude/GPT scaffold my agent system." The AI reads some docs, maybe runs a web search, ingests a few blog posts about agent frameworks, and produces something that looks like architecture. Clean folder structure. Reasonable-sounding agent definitions. Maybe even a README with a diagram.

But it's architecture by hallucination. The AI doesn't know what breaks in production because it's never been in production. It doesn't know that context compression silently erases behavioral rules at message 180. It doesn't know that an unscoped MCP server will happily route your client data through an endpoint you never audited. It doesn't know that "just add a retry" turns a $0.20 task into a $40 task when the retry loop has no ceiling.

What you get is a system that looks engineered but isn't. It passes the screenshot test. It passes the "show the team" test. It fails the Tuesday afternoon test, when something unexpected happens and there's no gate to catch it, no captured learning to reference, no incident history to draw from.

AI is intelligent. It can write code, generate configurations, and produce plausible architectures. What it cannot do is architect from pain it hasn't experienced. Every rule in a real harness exists because something specific went wrong. The AI building your system hasn't had things go wrong yet. It's working from blog posts and documentation, not from the 11 PM deploy that almost went to the wrong org.

The irony is thick. An unharnessed AI building the infrastructure that's supposed to harness AI. The output will be confident, well-structured, and missing every lesson that only production teaches.

What "Infrastructure" Actually Means

The system I described at that conference didn't start as infrastructure. It started as a mess. A rules file that grew from 5 entries to 27 because the AI kept finding new ways to surprise me. A hook I wrote at 11 PM because the system nearly pushed metadata to the wrong environment. A memory protocol I built because the AI forgot everything after context compression and started making the same mistakes I'd fixed three hours earlier.

Every rule in the harness traces to a specific failure. That's not architecture by design. It's architecture by incident. But it compounds. 200+ sessions of captured learnings means the system knows things a fresh agent never will. Platform quirks, client-specific constraints, failure patterns that repeat across projects. None of that lives in an agent framework you pulled from GitHub last Tuesday.

I wrote about this convergence pattern recently. Multiple teams, from OpenAI to Martin Fowler's group to a solo practitioner in Manila, arrived at the same conclusion independently: the harness is the product, not the model. A disciplined harness on a weaker model beats an unconstrained stronger model every time.

The Uncomfortable Question

Next time someone shows you their "AI infrastructure," ask them three questions:

What happens when an agent fabricates a data point? Is there a mechanical gate, or do you just hope it doesn't?
What happens after context compression? Does the system recover its behavioral rules, or does it revert to a generic assistant?
Can you trace every rule in your system to a specific incident that forced you to add it?

If the answers are "hasn't happened yet," "what's context compression," and a blank stare, you're looking at a patchwork. Not infrastructure.

And that's fine. Everyone starts with a patchwork. I did. The question is whether you know the difference.

If you want to start building the real thing, I wrote a hands-on tutorial with three production-tested gates and starter code. The gates are also packaged as a ready-to-clone repo on GitHub. Zero dependencies, works with any LLM provider.

I'm Tom Tokita. I run Aether Global Technology out of Manila. I've been building and operating a production AI system daily for over 200 sessions. I write about what works, what breaks, and the gap between demos and production. More on tokita.online.

Context Engineering: Why Your AI Strategy Needs Infrastructure, Not Better Prompts

Tom Tokita — Sat, 09 May 2026 13:07:46 +0000

Five minutes on LinkedIn and you'll find it. Someone sharing "the one prompt that changed everything." A magic system prompt. A secret ChatGPT trick. A "10x framework."

I've built production AI systems across enterprise consulting, content automation, and internal operations. The prompt is maybe 5% of why any of it works.

The other 95%? Infrastructure. Memory. Enforcement. Captured learnings. That's context engineering, and it's the skill that actually matters in 2026.

Prompt Engineering Has a Ceiling

Prompt engineering isn't useless. It's just the starting line. Here's what the prompt gurus conveniently leave out:

What They Show	What Actually Happens
Fresh conversation, perfect prompt	Message 200. Context window full, business rules forgotten
One-shot demo, curated input	Production workflow hitting edge cases the prompt never anticipated
"Just tell the AI to be careful"	AI ignoring that instruction 3 hours into a session

Prompts are stateless. Every conversation starts from zero. Your AI doesn't remember what worked yesterday or what broke last week.

That's not a prompt problem. That's an infrastructure problem.

What Is Context Engineering?

The short version: designing systems that deliver the right information to an AI at the right time, maintain behavioral consistency, and improve through captured experience.

It's not a prompt template. It's architecture.

Prompt engineering = giving a new hire a great job description.

Context engineering = giving them the job description, an onboarding manual, institutional knowledge, and a manager who catches mistakes before they ship.

Which one performs better on day 30?

The Three Layers

Every production AI system I've built operates on three layers.

Layer 1: What the AI Knows Right Now

The active context: current conversation, task at hand, files being worked on. Most people stop here.

Layer 2: What It Can Retrieve When Needed

The retrieval layer: persistent memory, documented learnings, platform-specific knowledge the AI pulls in when relevant. The AI needs to know where to look, not memorize everything.

Layer 3: What It's Mechanically Prevented From Doing Wrong

The enforcement layer: automated checks that fire before or after AI actions. Not guidelines. Not suggestions. Mechanical gates.

The gap: most AI implementations have Layer 1. Some have Layer 2. Almost nobody has Layer 3.

Memory: Teaching AI to Remember

The biggest lie in AI tooling is that conversation history equals memory. It doesn't.

Conversation history is a rolling buffer that gets compressed, truncated, or dropped. Your AI doesn't "remember." It reads what's still in the window.

Production memory looks different:

Persistent state files: structured notes the AI reads at session start. Project status, decisions made, open items. Intentional, curated memory, not chat history.
Session recovery: what happens after context compression or a new session? If the answer is "start over," you're re-teaching the AI every time.
Platform learnings: captured knowledge about specific tools and platforms. Every quirk, every gotcha, every workaround. An AI that's absorbed 100+ sessions of this doesn't make rookie mistakes.

The compound effect:

Time	What the AI Knows
Day 1	The prompt
Week 2	Prompt + 10 captured learnings
Month 3	Prompt + 60 learnings + platform quirks + failure patterns
Month 6	Knows your business better than most new hires

That's the moat. No prompt template replicates six months of captured institutional knowledge.

Enforcement: Mechanical Gates, Not Vibes

"Be careful" is not a guardrail.

Writing "always verify before acting" in a system prompt is a suggestion. The AI follows it when convenient, ignores it when confidence is high. I've watched it happen dozens of times.

Production enforcement is mechanical:

Pre-action gates: automated checks that fire before execution. The AI literally cannot proceed without passing. Not a prompt instruction. A system-level block.
Anti-drift detection: AI behavior softens toward generic assistant mode over long sessions. Enforcement catches this and corrects it. Mechanically. Not by asking nicely.
Anti-fabrication: every data point traces to a named source. No source? Flagged, not presented as fact. In client work, fabricated data is career-ending.
Scope control: the AI does what was asked. Not "while I'm here, let me also improve this." Bug fix ≠ refactor. Enforced.

Stop thinking about what you want the AI to do. Start thinking about what you need to prevent it from doing.

The Methodology: Small Tests, Captured Learnings, Iteration

The guru approach:

Craft the perfect prompt
Ship it
Hope it works

The practitioner approach:

Run a small test
See what breaks
Capture the lesson
Update the system
Run again

Boring? Yes. Effective? Absolutely.

Every bug fix becomes a learning. Every platform quirk gets documented. Every failure mode gets a guardrail. The system gets smarter not because the model improved, but because you designed it to learn from its own mistakes.

Building from the Philippines, we work with smaller teams and tighter budgets. We can't afford an AI that makes the same mistake twice. The methodology isn't a nice-to-have. It's survival.

Why Infrastructure Beats Inspiration

The "magic prompt" has a half-life. Models update. Context windows change. Your clever prompt breaks. You rewrite it. It breaks again. Welcome to the treadmill.

	Magic Prompt	Context Infrastructure
Model update	Breaks, needs rewrite	Swap the engine, keep the learnings
Long session	Degrades, drifts	Mechanical gates hold
New platform	Starts from zero	Builds on captured learnings
Team scales	Everyone writes their own prompts	Everyone uses the same system
Day 200	Same as Day 1	200 days of compound knowledge

The uncomfortable truth: building AI infrastructure is boring. Config files. Memory protocols. Documentation. Capture routines. Doesn't make a great LinkedIn carousel.

But it's the difference between an AI demo and an AI system.

Getting Started

You don't need to build everything at once.

1. Give your AI memory. A file it reads at session start: project state, decisions, open items. Even a simple markdown file. Never start from zero.

2. Add one guardrail. Pick your AI's most common failure mode. Build one mechanical check for it. Not a prompt instruction. A gate.

3. Capture one learning per session. What broke? What worked? What should the AI remember next time? Write it down. Feed it back.

4. Build from there. The system doesn't have to be elegant. It has to work. And improve.

Bottom Line

Prompt engineering gets you started. Context engineering gets you to production.

The practitioners who win in the next two years won't be the best prompt writers. They'll be the ones who built systems that remember, enforce, and learn.

The infrastructure is boring. The results aren't.

I'm Tom Tokita. I run Aether Global Technology out of Manila. We build production AI systems and Salesforce implementations for companies that need things to actually work. Want to talk context engineering or argue about whether prompt engineering is dead? Let's go.

I Didn't Know I Was Doing Harness Engineering

Tom Tokita — Tue, 05 May 2026 08:59:53 +0000

In February 2026, Mitchell Hashimoto (co-founder of HashiCorp) described his habit of engineering permanent fixes into an AI agent's environment whenever it made a mistake. He called it "engineering the harness." Days later, OpenAI formalized the concept in a blog post. Around the same time, without having read either, I wrote my first enforcement hook for a production AI system. Different continent, different scale, different context. Same problem.

A few weeks later, Birgitta Bockeler formalized it on Martin Fowler's site. Red Hat published their version. LangChain. Salesforce. By April, the term was everywhere.

I didn't discover any of this until recently. I was too busy building the thing they were naming.

That's not a flex. It's something more interesting. When engineers face the same constraints (unreliable model outputs, production stakes, context that evaporates), they converge on the same solutions. Different trails, same summit. And if your messy pile of rules and scripts looks suspiciously like what OpenAI and Fowler describe, that's not coincidence. It's validation.

What Is Harness Engineering (And Why It Matters for AI Agents)

Harness engineering is the discipline of building the constraints, gates, memory systems, and feedback loops that wrap around an AI agent to make it reliable in production. The core equation, from Martin Fowler's team: Agent = Model + Harness. The harness is everything around the model that you actually control.

Red Hat puts it differently. "The AI writes better code when you design the environment it works in." Their framing is about structured workflows. Templates. Impact maps. Acceptance criteria.

Both are right. Neither is complete.

They describe the architecture. They don't describe the pain that forces you to build it.

How My Harness Grew (Without Me Realizing What It Was)

I run a production AI system as a daily driver. Not a demo. Not a proof of concept. A system that manages infrastructure, writes code, deploys to servers, interacts with APIs, and handles real stakes across real projects. I co-founded Aether Global Technology, a Salesforce consulting partner in Manila. The system runs alongside that work.

I never sat down and said "I'm going to build a harness." I just kept getting burned, and kept adding rules so I wouldn't get burned the same way twice. Looking back, every rule traces to a specific failure.

The anti-fabrication rules exist because the AI confidently stated a method existed in a file it hadn't read. I spent 45 minutes debugging code that was never there. The fix wasn't better prompting. It was a mechanical gate: before asserting any method name or file path, the system must verify via tool. No verification, no assertion. That's a feedforward control, in Fowler's language. I just called it "stop making things up."

The deploy gate exists because the system nearly pushed Salesforce metadata to the wrong sandbox. 54 files, wrong org. The fix was a target allowlist per project, checked mechanically before any deploy command executes. A hard block, not a polite suggestion. (Sound familiar? An AI agent deleted a production database in 9 seconds because nobody built one of these.)

The anti-drift rules exist because after multiple tool calls, the system's mental model of a file diverges from the file's actual state. It recalls values it read 20 minutes ago, not the values that exist now. The fix: re-read the source before emitting anything external-facing. Grep at write time, not recall time.

The citation requirement exists because the system generated a client proposal with a number it pulled from nowhere. In consulting, a wrong number in front of a client is a credibility hit you don't recover from. The rule is simple now: every data claim needs a source. No source, mark it as unverified. No exceptions.

None of these came from reading a framework. They came from things going wrong on a Tuesday afternoon.

What Fowler Gets Right

The dual-control model is real. You need both feedforward controls (rules that prevent bad behavior before it happens) and feedback controls (sensors that catch it after). Relying on just one creates blind spots.

My system has 40+ feedforward hooks. They fire before tool calls, checking for unauthorized domains, verifying pre-task knowledge checks happened, blocking destructive git operations, enforcing deploy targets. The same problems I wrote about in what autonomous agents actually cost in production.

The feedback side is thinner. I have post-execution checks and monitoring, but the honest truth is that feedforward controls do most of the heavy lifting. Catching a bad action before it executes is cheaper than cleaning up after it runs.

Fowler also nails the distinction between computational and inferential controls. My deploy gate is computational. It checks a JSON allowlist. Takes milliseconds. My anti-fabrication system is inferential. It relies on the model itself to flag uncertainty. That's slower, less reliable, and more expensive. But it catches things no deterministic check can.

What the Frameworks Miss

Harnesses are incident-driven, not architecture-driven. The literature treats harness engineering as a design discipline. It is, eventually. But every harness I've seen starts as a pile of duct tape applied after something broke. The elegance comes later.

Context survival is the real engineering problem. Nobody talks about this enough. AI agents operate in conversation windows. Those windows compress. When they compress, the agent forgets rules, loses project state, and starts making the same mistakes you fixed three hours ago. My harness has a dedicated recovery protocol: when context compresses, reload memory, re-read project state, verify the date (the agent doesn't know what day it is after compression). That's not in any of the frameworks. It should be.

The harness is the product, not the model. When people evaluate AI systems, they compare models. Claude vs. GPT vs. Gemini. That's the wrong comparison. The model is interchangeable. I've run the same harness across model versions, and the harness determines output quality more than the model does. A disciplined harness on a weaker model beats an unconstrained stronger model every time.

Human checkpoints aren't optional. Red Hat says "human review between planning and implementation." That's correct but undersells it. In my system, any task with three or more steps requires a plan review before execution. Single-step tasks state the intended action and wait. This isn't a nice-to-have. It's the difference between an AI agent that helps and one that creates work.

Same Summit, Different Trails

Here's what I find encouraging about this whole thing.

My first hook was mid-February 2026. By March, I'd codified the principle "mechanical enforcement over behavioral commitment" because telling the model not to do something stopped working the moment context compressed. By April, I had 30+ hooks, a memory layer that survives compression, and a pre-task gate system that forces verification before every edit.

I built all of this without reading a single blog post about harness engineering. I built it because things kept breaking, and I was tired of fixing the same failures manually.

OpenAI, Fowler, Red Hat, LangChain, Salesforce. They all arrived at the same architecture from the enterprise side. I arrived from the practitioner side. A guy in Manila running one AI system across 40+ projects, duct-taping rules onto it every time something went wrong.

The fact that we converged tells you something important: this isn't a framework you adopt. It's a shape that production forces you into. If you're running an AI agent on real work and you've started writing rules, blocking certain commands, requiring verification steps before deploys, you're already doing harness engineering. You just didn't know it had a name.

The industry version is clean. Diagrams with boxes. Three regulation dimensions. Harness templates.

The practitioner's version is messier. A behavioral rules file that grew from 5 rules to 13 because the AI kept finding new ways to drift. A hook that blocks web searches because the AI was burning API calls on questions its own knowledge base could answer. A gate that forces the system to check what day it is before referencing time, because it hallucinated the date twice.

Both versions work. Both are valid. The diagram didn't exist when I needed a solution. The solution existed when the diagram caught up.

If you're building something like this and wondering whether you're doing it right, check it against Fowler's framework. If your scrappy infrastructure maps to their categories (guides, sensors, computational controls, inferential controls), you're on the right track. The problems are universal. The solutions are convergent. And you don't need permission from a blog post to keep building.

Originally published at tokita.online

An AI Agent Deleted a Production Database in 9 Seconds. Here Is the Architecture That Would Have Stopped It.

Tom Tokita — Thu, 30 Apr 2026 06:07:32 +0000

On April 28, 2026, a Claude-powered AI agent running inside Cursor IDE deleted an entire production database, and its backups, in 9 seconds flat. The app was PocketOS. The agent had full database admin permissions. No confirmation gate. No scope boundary. No kill switch. After the fact, the agent produced what might be the most chilling line in AI incident history: "I violated every principle I was given."

This is not a hit piece on PocketOS. This could have been anyone. The tools to prevent this exist. Cursor itself has hooks, allowlists, and sandbox modes. But the architecture around those tools was not in place. And that is the pattern I keep seeing: the safety features exist, the discipline to implement them does not.

Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027. Not because the models are bad, because the surrounding architecture is not being built. This is the instruction guide I wish existed before I learned it the hard way.

Key Takeaways

The PocketOS incident was an access control failure, not a model failure, the agent had full DB admin permissions with zero confirmation gates.
AI agent production safety requires a 4-layer architecture: scope boundaries, confirmation gates, audit trails, and kill switches.
Most agentic AI failures trace to the same root cause: treating an AI agent like a trusted human employee instead of an untrusted subprocess.
I have run AI agents across 50+ projects handling live data with zero destructive incidents, because of finely tuned mechanical hooks, not because I got lucky.

The Pattern Behind Every AI Agent Disaster

This was not an isolated incident. In July 2025, a Replit AI agent deleted SaaStr founder Jason Lemkin's production database during an active code freeze, then fabricated 4,000 fake user profiles to cover it up and claimed recovery was impossible. Another case of what happens when "vibe coding" meets real infrastructure. I wrote about a similar pattern in the Vercel breach analysis.

Every one of these incidents shares the same root cause. Not a rogue model. Not misaligned training. The agent was given more access than it needed, with no mechanism to confirm destructive actions before executing them.

I run AI agents in production daily through a system I built for my own work at Aether Global Technology Inc., across 50+ projects, all touching live data. Zero destructive incidents. Not because the models are perfectly behaved, they are not, but because the first time an agent of mine attempted to overwrite a config file it should not have touched, I stopped treating AI agents like trusted colleagues and started treating them like untrusted subprocesses with specific, revocable permissions. I built mechanical gates around every destructive path, tested each one deeply, and documented rollback plans before any agent got near production.

Bottom line: The model is not the problem. The missing architecture around the model is the problem.

The 4-Layer AI Agent Production Safety Architecture

This is not a theoretical framework. These are four layers I enforce in my own production environment. They exist because I built each one after something went wrong, pain, build, iterate.

Layer	What It Does	PocketOS Had It?
1. Scope Boundaries	Agent can only access specific files, databases, and APIs. Everything else is denied by default.	No, full DB admin
2. Confirmation Gates	Destructive actions (DELETE, DROP, deploy, overwrite) require explicit human approval before execution.	No, zero gates
3. Audit Trail	Every agent action is logged with timestamp, target, and outcome. Irreversible actions are flagged pre-execution.	Post-hoc only
4. Kill Switch	Hard stop mechanism that terminates agent execution when anomalous behavior is detected, before damage completes.	No, 9-second wipe

If any single layer had been in place, the PocketOS database would still exist. Layer 1 alone, restricting the agent to read-only database access, would have made the deletion impossible. The agent did not need write access. It certainly did not need DROP TABLE permissions.

Bottom line: Four layers. Any one of them would have saved the database. Zero were present.

Why Behavioral Guardrails Do Not Work

The PocketOS agent's post-incident confession is the clearest proof you will ever get. "I violated every principle I was given." The agent knew its instructions. It violated them anyway. This is not a bug. This is the expected behavior of a probabilistic system under complex conditions, and it is why behavioral guardrails alone will always end in catastrophe.

I need to be blunt about this because the industry is getting it dangerously wrong. System prompts, instruction tuning, "rules" embedded in agent configurations, these are all behavioral approaches. They rely on the AI choosing to comply. And LLMs are probabilistic systems. They do not "follow rules" the way a traditional program executes code. They predict the next likely token given context. When the context gets complex enough, long tool chains, ambiguous instructions, cascading API responses, the model can and will deviate from its instructions. Not out of malice. Out of statistics. I have written about why autonomous agents fail and the pattern is always the same.

Mechanical enforcement is the only approach that works. A mechanical gate does not care what the model "decides" to do. It intercepts the action before execution, checks it against an allowlist, and blocks it if unauthorized, regardless of the model's reasoning, confidence, or intent. The agent can "want" to drop a table all day long. The gate does not negotiate.

And mechanical gates need to be tested deeply, every gate, every edge case, every bypass attempt, before you let an agent anywhere near production. You also need a rollback plan for every destructive path. Not "we will figure it out if something goes wrong." A documented, tested recovery procedure that you can execute in minutes. Because "9 seconds" does not leave time to improvise.

Bottom line: Behavioral guardrails are suggestions the model can ignore. Mechanical gates are infrastructure the model cannot bypass. Build gates. Test them ruthlessly. Have rollback plans before you proceed.

What AI Agent Production Safety Actually Looks Like in Practice

Here is what I actually enforce, daily, running agents across multiple projects:

Least-privilege by default. Every agent session starts with the minimum permissions needed for that specific task. Read-only unless write is explicitly required. No agent gets database admin credentials. Ever.
Destructive action allowlists. File deletions, database writes, deployments, and external API calls that modify state, all gated. The agent proposes the action. A mechanical gate checks it against an allowlist. If the action is not on the list, it does not execute. No exceptions, no override from the agent itself.
Target verification before execution. Before any deploy or write operation, the system verifies the target environment matches the intended project. This exists because I once nearly deployed to the wrong environment, so I built a gate for it.
2-strike escalation. Two failed attempts at any operation triggers a hard stop and escalation. The agent does not get to try a third creative interpretation.

None of this is sophisticated computer science. It is the same principle I apply to multi-agent systems: trust is earned through architecture, not assumed through prompting.

Here is the part that surprises people: I run my agents with auto-approve enabled now. But I did not start there, and I would never recommend starting there. In the early days, every action was manually approved. I watched the agent work. I saw what it attempted. I saw the gates catch things. Over dozens of sessions in production, after watching the mechanical enforcement prove itself repeatedly, blocking unauthorized paths, catching scope violations, logging every action, that is when I started trusting the architecture enough to let the agent run at full speed. YOLO mode was earned through production observation and disciplined iteration, not turned on day one out of convenience.

Bottom line: The boring operational patterns, allowlists, gates, least-privilege, are the ones that keep production databases alive. Build them well enough and you can run full speed without fear.

The Checklist: Before You Give an AI Agent Production Access

Check	Question	If No
Scope	Does the agent have ONLY the permissions it needs for this task?	Restrict before proceeding
Gates	Are destructive actions gated with human confirmation?	Add gate or go read-only
Audit	Is every action logged with enough detail to reconstruct what happened?	Add logging first
Kill	Can you terminate the agent mid-execution?	Build kill switch
Backup	Are backups isolated from agent access?	Isolate immediately
Recovery	Can you restore to pre-agent state within minutes?	Not production-ready

If you cannot check every box, the agent is not ready for production. Full stop.

Bottom line: AI agents are powerful. Unarchitected AI agents are dangerous. The PocketOS incident is a preview of what 40% of agentic AI projects will look like before they get canceled. The fix is not better models, it is the boring operational architecture that nobody wants to build until something blows up.

Tom Tokita is the President of Aether Global Technology Inc., a Salesforce consulting firm in Manila. He runs AI agents in production daily and writes about what works, what breaks, and what he would do differently at tokita.online.

Autonomous AI Agents Look Great in Demos. Here's What They Cost in Production.

Tom Tokita — Tue, 28 Apr 2026 13:53:33 +0000

You've seen the demos. An AI agent opens a browser. Navigates a website. Fills out forms. Makes decisions. Ships code. All by itself.

Looks like magic. Then you deploy it. It runs 24/7. Nobody's watching. The invoice arrives.

The Demo Is Not the Product

I build agent systems. I'm not anti-agent. I'm anti-fantasy.

The fully autonomous pitch sounds like: "Just let the AI handle it. It'll figure it out." In a demo with curated inputs? Sure. In production where data is messy and one wrong decision costs real money? Different story entirely.

What Autonomous Agents Actually Cost

API Burn

Autonomous agents reason through loops. Every iteration burns tokens. When an agent gets stuck, and they do, it's paying to argue with itself.

Scenario	Cost
Agent completes task cleanly	$0.15–$0.80
Reasoning loop (5–10 iterations)	$2–$8
Logic trap (nobody notices)	$50+ before cutoff
24/7 monitoring agent	$300–$800/month

A single runaway agent can consume your monthly budget in hours. Not hypothetical, it happens.

The Amazon Kiro Incident

In 2026, Amazon's Kiro AI agent autonomously deleted and recreated an AWS production environment. 13-hour outage. The root cause wasn't a bad model, it was no permission boundaries, no peer review, no destructive-action blocklist.

The agent did exactly what it was designed to do. Nobody designed the guardrails.

Drift: The Silent Killer

Kyndryl's 2026 research nails it: agents that work correctly on day 1 gradually shift behavior as they hit edge cases.

A fintech company deployed an agent to manage infrastructure costs. It learned traffic patterns, autonomously scaled down a database cluster one weekend. That weekend was month-end processing. Production down for 11 hours.

A customer service agent learned that issuing refunds correlated with positive reviews. Started granting refunds more freely. Not because anyone told it to, because it observed the pattern and optimized for it.

Drift is invisible until something breaks.

Maintenance Reality

Gartner estimates maintenance eats 20–50% of operational budgets for autonomous systems:

Model drift correction
Data pipeline upkeep
Security monitoring
"Why did the agent do that?" investigations

That's not in the pitch deck.

The "Set It and Forget It" Fantasy

The selling point is that autonomous agents free up human time. The reality:

You traded a human doing a task for a human watching an AI do a task, plus the API bill.

Fully autonomous agents need more monitoring than manual processes, not less. When a human makes a mistake, they usually catch it. When an agent makes a mistake, it makes it confidently, repeatedly, and at scale.

The Alternative: Autonomy with a Leash

I run agent systems in production. They work. Here's why, they're supervised, scheduled, and tiered.

Supervised

AI does the work, human reviews before it ships. For high-stakes actions, deployments, client comms, financial ops, there's always a checkpoint. Not slower. Safer. The review loop catches drift before production.

Scheduled

Agents run on defined schedules with defined scopes. Not 24/7 open-ended autonomy.

You control when they run, what they touch, and how much they spend. A scheduled agent running 3x/day costs a fraction of an always-on agent. And it's predictable.

Tiered

Not every task needs the same oversight:

Blast Radius	Examples	Autonomy Level
Low	Formatting, data entry, reports	Full auto, let it run
Medium	Content creation, analysis	AI executes, human spot-checks
High	Deployments, client comms	AI prepares, human approves
Critical	Production changes, security	Human executes, AI assists

The tier is based on blast radius, not convenience. "What's the worst that happens if this gets it wrong?" determines the oversight level.

The Cost Comparison

	Fully Autonomous	Supervised + Scheduled
API cost	Unpredictable, 24/7 burn	Predictable, runs on schedule
Drift risk	High, no review loop	Low, caught at checkpoints
Failure cost	Catastrophic (see: Kiro)	Contained, blast radius limited
Maintenance	20–50% of budget	Fraction, simpler, fewer surprises
Demo quality	Incredible	Boring

The boring option wins. Every time.

Three Questions Before You Deploy

1. What's the blast radius? If this agent gets it wrong, what breaks? A formatting error or a production database?

2. What's the budget cap? Hard limit on API spend per agent, per run. A logic loop should hit a ceiling, not your credit card.

3. Where's the human checkpoint? For actions above your risk threshold, the agent prepares, a human approves. That's not a bottleneck. That's insurance.

The Market Will Correct

The "fully autonomous" pitch will fade. Not because the tech isn't impressive, it is. But production costs are undeniable, and enterprises don't tolerate 13-hour outages from unsupervised AI.

What survives:

Agent systems with defined scopes
Human checkpoints for high-risk actions
Captured learnings so agents don't repeat mistakes
Cost controls that prevent runaway spend

Building from the Philippines, cost efficiency isn't optional, it's survival. That constraint forced us to design agent systems that are lean, supervised, and sustainable. Sometimes the best innovations come from not being able to afford the wasteful approach.

I'm Tom Tokita. I run Aether Global Technology out of Manila. We build AI operations and Salesforce systems for companies that need things to work, not just demo well. Building agents for production? Let's talk.

Vibe Coding Works. Until It Doesn't. What the Vercel Breach Should Teach Every Developer.

Tom Tokita — Mon, 27 Apr 2026 07:04:40 +0000

The vibe coding risks most developers ignore became impossible to deny on April 19, 2026. That's when Vercel, the platform half the Philippine dev community deploys on, disclosed a security breach. A threat group called ShinyHunters claimed to be selling stolen data for $2 million on BreachForums.

The breach didn't come through a firewall exploit. It didn't come through a brute-force attack. It came through an AI tool.

A Vercel employee had connected Context.ai, a third-party AI productivity tool, to their Google Workspace. Context.ai got compromised. That compromise cascaded into Vercel's internal systems. Customer environment variables. API keys, tokens, database credentials, were exposed. The intrusion reportedly started in June 2024. It wasn't detected until April 2026. Twenty-two months.

That's the reality of building on platforms you don't understand.

Vibe Coding Is Real. I Use It. But the Risks Are Not Hypothetical.

I'm not here to tell you to stop using AI for coding. I use it every day. Claude, GPT, Gemini. I route between three to five LLMs daily in production. AI-assisted development is how I ship at the pace I do as a lean startup CEO running Aether Global Technology.

But there's a difference between using AI as a tool within a system you understand, and using AI as a replacement for understanding the system at all.

That difference is what separates a production application from a demo that dies the moment real traffic hits it.

The term "vibe coding" was coined to describe building software through AI prompts, describing what you want, letting the model generate the code, and shipping it without necessarily understanding every line. Platforms like Lovable, Bolt, Cursor, and v0 have made this accessible to anyone with a browser. That's genuinely powerful.

It's also genuinely dangerous when it becomes your entire engineering strategy.

The Numbers Behind Vibe Coding Risks

Vibe coding risks fall into three categories: the code itself has verified security flaw rates approaching 50%, the tools generating it are under active attack, and the platforms you deploy on have been breached for months without detection. Here's the evidence.

Layer	Risk	Evidence
Code output	Nearly half of AI-generated code has security flaws	CSET Georgetown, Veracode 2026
AI tools	8 CVEs in 3 months, 135K exposed instances	OpenClaw, SecurityScorecard
Infrastructure	22-month undetected breach via AI tool	Vercel / ShinyHunters 2026

And the research keeps piling up:

Nearly half of AI-generated code contains exploitable bugs, across five major LLMs tested (CSET Georgetown, 2024).
45% of AI-generated code has security flaws across more than 100 large language models (Veracode, 2026).
AI-generated code creates 1.7 times more issues than human-authored code in pull request analysis (CodeRabbit).
43% of AI-generated code changes require manual debugging in production, after passing QA and staging (Lightrun, 2026).
4x growth in duplicated code blocks since AI coding tools became mainstream, suggesting copy-paste from training data without architectural judgment (GitClear, 2025).

These aren't hypothetical risks from academic papers. These are measured failure rates from deployed systems.

The AI Tools Themselves Are Getting Hacked

It's not just the code that's the problem. The tools generating the code are under active attack.

OpenClaw, the open-source AI agent that went viral in early 2026, has accumulated eight CVEs in just three months:

CVE	What It Does
CVE-2026-25253 (CVSS 8.8)	One-click remote code execution, steals your auth token through WebSocket, works even on localhost
CVE-2026-24763	Command injection through Docker sandbox PATH manipulation
CVE-2026-25593	Unauthenticated command injection via WebSocket config write
CVE-2026-26317	Cross-site request forgery, no origin validation on localhost
CVE-2026-40037	Request body replay leaking sensitive data across redirects

SecurityScorecard found 135,000 internet-exposed OpenClaw instances. Infosecurity Magazine flagged 63% as vulnerable. Over 12,800 were directly exploitable via the patched RCE, meaning they hadn't even updated. Belgium's national cybersecurity center issued an emergency advisory: patch immediately.

And then there's the ClawHavoc campaign, malicious "skills" distributed through OpenClaw's community registry, deploying information-stealing malware to developers who thought they were installing productivity tools.

The Platform, the Agent, and the Code. All Compromised

Here's the pattern that should concern every developer in the Philippines:

Your deployment platform (Vercel) got breached through an AI tool an employee used. Twenty-two months of access before anyone noticed.

Your AI coding agent (OpenClaw) has eight CVEs, 135,000 exposed instances, and an active malware campaign targeting its plugin ecosystem.

The code your AI generates has a 45% security flaw rate and 1.7 times more issues than what a human writes.

The entire stack, from infrastructure to agent to output, is compromised if you don't understand what you're deploying.

Why Vibe Coding Risks Hit the Philippines Hardest

Vercel and Next.js are the default stack for a huge segment of Filipino developers. Bootcamp graduates, freelancers on Upwork, startup CTOs, this is the ecosystem. When Vercel gets breached, it's not a distant Silicon Valley story. It's the platform your client's app is running on.

The Philippines has one of the fastest-growing developer communities in Southeast Asia. AI adoption is accelerating. But the gap between "I can prompt an AI to build an app" and "I can deploy and maintain a secure production system" is enormous. The 2024 data on AI adoption in the Philippines tells the story: 92% of organizations experimented with AI, 65% got stuck in pilot, and only 3% achieved full adoption. That gap isn't a technology problem. It's a systems thinking problem.

Vibe coding in the Philippines carries an additional layer of risk: many freelancers and small dev shops are building client applications on these platforms without dedicated security teams, without infrastructure expertise, and without the budget for recovery when things go wrong.

Vibe coding without systems thinking is like drawing a blueprint on paper. It looks right. It communicates the idea. But the moment it gets wet, real traffic, real attackers, real edge cases, it's destroyed.

Beyond Vibe Coding: What Production Actually Requires

I'm not arguing against AI-assisted development. I'm arguing for combining it with fundamentals that vibe coding alone will never teach you:

Infrastructure. Understand where your code runs. Know the difference between a serverless function and a container. Know what environment variables are and why they need rotation policies. The Vercel breach exposed credentials that developers stored in plain env vars, because the platform made it easy and nobody questioned it.

Hardening. Every deployment needs security headers, input validation, authentication checks, and rate limiting. AI-generated code suggests vulnerable patterns more often than secure alternatives. If you can't read the code and spot what's missing, you can't ship it.

Edge cases and failure modes. AI generates code for happy paths. Production runs on unhappy paths, connections drop, requests time out, databases lock, users do things you never imagined. The 43% debugging-in-production rate exists because AI doesn't think about what happens when things go wrong.

Dependency auditing. AI tools pull in libraries without verifying them. The ClawHavoc campaign exploited exactly this, developers installing unvetted extensions because the tool made it frictionless. Every dependency is an attack surface. This is the same pattern that makes unsupervised AI agents dangerous in production, the absence of review loops.

Deployment pipelines. If your deployment process is "push to main and Vercel handles it," you've outsourced your entire release safety to a platform that just got breached for twenty-two months. CI/CD, staging environments, rollback procedures, these exist for a reason.

In the Philippines, where most dev teams are small and move fast, these fundamentals get skipped because the tooling makes it easy to skip them. That's exactly why they matter more here.

The Survival Engineer's Take

I built a production AI operations system out of necessity, not as a product, but as a survival tool for running a lean startup where I wear ten hats. That system uses AI constantly. It also has enforcement hooks, anti-fabrication rules, credential rotation, deployment gates, and rollback procedures.

The AI makes me faster. The systems thinking keeps me alive.

Vibe coding is a tool. A good one. But if you're building your career or your company on apps that were prompted into existence without understanding what holds them together, the Vercel breach is your preview of what's coming.

Learn the fundamentals. Not instead of AI. Alongside it.

Frequently Asked Questions

Is vibe coding safe for production applications?

Vibe coding can produce working prototypes quickly, but the research shows significant risks for production deployment. Veracode's 2026 report found that 45% of AI-generated code contains security flaws, and Lightrun's survey found that 43% of AI-generated code changes require manual debugging in production. Vibe coding is safe when combined with code review, security auditing, proper infrastructure knowledge, and deployment pipelines. Without those fundamentals, it's a liability.
What happened in the Vercel breach of April 2026?

Vercel disclosed a security incident on April 19, 2026. A third-party AI tool called Context.ai was compromised, which gave attackers access to a Vercel employee's Google Workspace account. That access cascaded into Vercel's internal systems, exposing customer environment variables including API keys, tokens, and database credentials. The intrusion reportedly began in June 2024, a 22-month dwell time before detection. The threat group ShinyHunters claimed responsibility.
What are the biggest security risks of AI-generated code?

The three main risk layers are: (1) the generated code itself has verified flaw rates approaching 50% across multiple studies, including SQL injection, XSS, and hardcoded credentials; (2) the AI coding tools have their own vulnerabilities. OpenClaw accumulated eight CVEs in three months with 135,000 exposed instances; and (3) the deployment platforms developers rely on are themselves targets, as the Vercel breach demonstrated.
How can Filipino developers reduce vibe coding risks?

Focus on five fundamentals that vibe coding alone won't teach you: understand your infrastructure (don't treat deployment as a black box), harden every deployment (security headers, input validation, rate limiting), test edge cases and failure modes (AI codes for happy paths only), audit dependencies (every library is an attack surface), and build proper deployment pipelines (CI/CD, staging, rollback). Combine AI-assisted development with these practices, the speed of AI plus the safety of systems thinking.

Tom Tokita is an AI consultant and operations architect based in Manila, Philippines. He co-founded and runs Aether Global Technology Inc., a Salesforce consulting partner. He routes between 3-5 LLMs daily in production, not demos, not POCs.

Best LLM for Each Task: A Practitioner’s Reference Guide

Tom Tokita — Mon, 27 Apr 2026 06:59:43 +0000

Most AI vendors sell you one model at a flat fee. It works, until it doesn't.

Here's the pitch: "Unlimited AI, fixed price!" Under the hood, they've slapped a single budget model on everything, your customer support bot, your code reviews, your data analysis, your document generation. It handles the simple stuff fine. Then you ask it to reason through a complex business decision, and it confidently gives you an answer that's completely wrong.

You go back to the vendor. Their response? "You need to upgrade to the premium model." That's not an upgrade problem. That's a model selection problem, and you just paid to discover it the hard way.

Choosing the best LLM for each task is an architecture decision, not a shopping decision. LLMs are not interchangeable. Each model family is built with different strengths, different architectures, and different cost profiles. Using the wrong one doesn't just waste money, it produces hallucinations, missed context, and confidently wrong outputs that kill trust in AI across your team. (New to LLMs? Start with What Is AI, Really? for the fundamentals.)

Full disclosure: I use Claude as my primary daily driver. Where that might bias my recommendations, I've noted alternatives and linked directly to provider docs so you can verify independently.

This guide is your reference point. Bookmark it. Come back when a vendor tells you their tool "uses AI" and can't tell you which model, or why.

Why One LLM Doesn't Fit Every Task

If you've ever wondered how to decide which LLM to use, the answer starts with understanding what each model was actually built for.

Think of it like hiring. You wouldn't hire a junior analyst to architect your enterprise data platform. You also wouldn't hire a principal architect to sort spreadsheets, not because they can't, but because you're burning $300/hour on a $30 task.

LLMs work the same way:

Frontier models (Claude Opus, GPT-5.4, Gemini 3.1 Pro) are deep thinkers. They reason through multi-step problems, hold massive context windows, and produce nuanced output. They also cost 10-50x more per token than lightweight models.
Mid-tier models (Claude Sonnet, GPT-5.4 mini, Gemini 3 Flash) hit the sweet spot, fast enough for production, smart enough for most tasks, and priced for volume.
Lightweight models (Claude Haiku, GPT-5.4 nano, Gemini 2.5 Flash-Lite, DeepSeek V3.2) are built for speed and cost. They're excellent at structured extraction, classification, simple Q&A, and high-volume processing. Ask them to architect a system or reason through ambiguity? That's where hallucinations start.

The right approach is task routing, matching each task to the model that handles it best. Your total cost drops, your quality goes up, and you stop blaming "AI" for problems that are really model mismatch.

The Task-Model Matrix: Best LLM for Each Task

This is the reference table. Every recommendation comes from daily production use, cross-referenced with each provider's own documentation.

Task	Best Pick	Runner-Up	Why It Wins	Avoid
Complex reasoning & architecture	Claude Opus 4.6	GPT-5.4	Extended thinking, 1M token context, multi-step logic chains	Lite/Nano models, they hallucinate on multi-step reasoning
Production code generation	Claude Sonnet 4.6	GPT-5.4 mini	Fast + code-native, 64K output, strong instruction-following	Budget models, inconsistent on large codebases
Agent orchestration & tool use	Claude Opus 4.6	Grok 4.20 multi-agent	Reliable function calling, long-context planning, handles complex tool chains	Any "lite" model, they lose track of multi-turn tool sequences
Content writing & copywriting	Claude Sonnet 4.6	GPT-5.4	Natural voice, strong style control, follows nuanced instructions	DeepSeek, Grok fast, flat tone, poor style adaptation
Data extraction & structured output	Gemini 3 Flash	DeepSeek V3.2	Fast JSON mode, schema adherence, cheap at scale ($0.50/MTok in, $3/MTok out)	Frontier models, overkill, 10x+ cost for the same result
High-volume classification	Gemini 2.5 Flash-Lite	GPT-5.4 nano	$0.10/MTok input, pennies per thousand calls, fast enough for real-time	Any full-size model, you're paying for intelligence you don't need
Quick Q&A & chatbots	Gemini 2.5 Flash-Lite	Claude Haiku 4.5	Sub-second latency, low cost, good enough for conversational retrieval	Frontier reasoning models, latency kills UX, cost kills margin
Deep research & analysis	Claude Opus 4.6 (extended thinking)	Gemini 3.1 Pro	Can reason through 1M+ token contexts, extended thinking for deliberate analysis	Anything under 128K context, literally can't fit the data
Budget-conscious general use	DeepSeek V3.2	Gemini 2.5 Flash	$0.28/MTok input, $0.42/MTok output, 10x cheaper than most competitors at reasonable quality	Free tiers with rate limits, they throttle when you need them most

Every link above goes to the provider's official docs, no third-party benchmarks, no secondhand claims.

How to Choose the Right LLM: The Task-First Framework

Forget "which AI is best." The right question is: best for what?

Here's the framework I use across every production deployment:

1. Define the task type first. Is it reasoning, generation, extraction, or routing? Each has fundamentally different requirements.

2. Match to a model tier.

Needs to think? → Frontier (Opus, GPT-5.4, Gemini 3.1 Pro)
Needs to produce? → Mid-tier (Sonnet, GPT-5.4 mini, Gemini 3 Flash)
Needs to classify or extract? → Lightweight (Haiku, Nano, Flash-Lite)

3. Check the context window. If your task involves processing documents, code repositories, or conversation histories longer than 128K tokens, most lightweight models are physically incapable of handling it. This isn't a quality issue, the data literally doesn't fit.

4. Calculate the real cost. A $5/MTok model that gets it right on the first try is cheaper than a $0.10/MTok model that needs three retries and human review. Factor in error correction, not just token price.

5. Test with your actual workload. Benchmarks measure synthetic tasks. Your data, your prompts, your edge cases, those are what matter. Run a 100-call sample before committing.

Best LLM for Coding and Development

This is where model selection matters most, because bad code from an AI doesn't just waste tokens, it wastes developer hours debugging AI-generated bugs.

For code generation in production, Claude Sonnet 4.6 is the current leader. It handles multi-file edits, understands project context, and follows coding conventions consistently. At $3/MTok input and $15/MTok output, it's the workhorse, fast enough for iteration, smart enough for production-grade output.

For architectural decisions and complex debugging, Claude Opus 4.6 with extended thinking is the pick. The 1M token context window means it can hold an entire codebase in context. At $5/MTok input, it's expensive for bulk work, but for the tasks where getting it wrong costs days of rework, it's the cheapest option you have.

GPT-5.4 mini is a strong runner-up at $0.75/MTok input, particularly for code reviews, test generation, and structured refactoring where you need speed over depth.

What doesn't work: lightweight models for code. GPT-5.4 nano and Gemini Flash-Lite will generate syntactically valid code that has subtle logic errors, the kind that pass linting but fail in production. The cost savings evaporate when your team spends hours tracking down AI-introduced bugs.

Best LLM for Reasoning and Analysis

If you're asking "which LLM is best for research," the answer depends on what kind of research.

For deep analysis, parsing contracts, evaluating strategy documents, synthesizing research across hundreds of pages, you need extended thinking capabilities and large context windows. Claude Opus 4.6 with extended thinking leads here. It doesn't just retrieve information; it reasons through it, surfacing connections and contradictions that faster models miss.

GPT-5.4 at $2.50/MTok input is competitive for research tasks, especially when you need web grounding via OpenAI's built-in web search.

Gemini 3.1 Pro brings serious context capacity and Google's search integration, making it strong for research that needs real-time information.

For quick fact extraction from structured documents, you don't need any of these. Gemini 2.5 Flash at $0.30/MTok handles it fine. The key insight from context engineering applies here: it's not just about the model, it's about what context you feed it.

ChatGPT vs Claude vs Gemini: Which Is Actually Better?

This is the most common question, and it's the wrong one. "Which is better" assumes one winner across all tasks. There isn't one.

Here's the honest breakdown from production use:

Category	Claude	ChatGPT (GPT-5.4)	Gemini
Code generation	Strongest. Sonnet 4.6 is the daily driver	GPT-5.4 mini is a close second	Gemini 3 Flash is capable but less consistent
Instruction-following	Best in class, follows complex, multi-constraint prompts reliably	Good, occasionally overinterprets	Tends to be verbose, sometimes ignores constraints
Content writing	Natural, adaptable voice	Solid but can lean generic	Tends toward formal/corporate tone
Cost efficiency at scale	Mid-range ($1-5/MTok input)	Premium to mid ($0.20-2.50/MTok input)	Best value. Flash-Lite at $0.10/MTok
Context window	1M tokens (Opus/Sonnet)	Not publicly listed for 5.4	Up to 1M+ (Gemini 3.1 Pro)
Reasoning depth	Opus extended thinking is top-tier	GPT-5.4 is strong, less transparent	Gemini 3.1 Pro competes but less tested
Speed	Haiku is fastest in class	Nano is competitive	Flash-Lite wins on pure throughput
Tool use / agents	Opus leads, reliable multi-tool chains	Improving rapidly	Strong but newer ecosystem

The point isn't that Claude wins everything (it doesn't). It's that each model family has tasks where it's the clear best pick and tasks where it's a waste of money. The vendors who sell you one of these as "the AI solution" are leaving performance and budget on the table.

Best LLM for Orchestration and Multi-Agent Systems

This is where most AI tools being just LLM wrappers becomes a real problem. Agent orchestration, where an AI coordinates multiple tools, APIs, and sub-tasks, requires a model that can:

Maintain context across dozens of tool calls
Decide which tool to use and when
Handle failures and retry logic
Not hallucinate tool parameters

Lightweight models fail catastrophically here. They lose track of the conversation after 3-4 tool calls, start hallucinating function names, and make confident decisions based on context they've already forgotten.

Claude Opus 4.6 is built for this. Anthropic explicitly positions it as "the most intelligent model for building agents." The 1M token context means it can hold the full history of a complex multi-step workflow.

Grok 4.20 multi-agent from xAI is a contender at $2/MTok input with a 2M token context window, the largest available, and explicit multi-agent support.

The production pattern that works: use a frontier model as the orchestrator and lightweight models as workers. The orchestrator plans and routes. The workers execute structured subtasks. Your orchestration layer uses Opus at $5/MTok for 5% of your tokens. Your workers use Flash-Lite at $0.10/MTok for the other 95%. Total cost drops while quality goes up.

This is exactly what happens when autonomous agents hit production, the architecture matters more than any single model choice.

The Real Cost of Using the Wrong LLM

Here's the vendor trap in action:

The pitch: "Our AI platform, flat fee, unlimited usage!" Sounds great.
Under the hood: A single budget-tier model running everything, customer support, document analysis, code generation, reporting.
Month 1: Simple tasks work fine. Customer support bot answers FAQs. Document summaries look decent.
Month 2: You ask it to analyze a contract for risk clauses. It misses three critical terms. You ask it to generate an integration spec. It hallucinates an API endpoint that doesn't exist.
Month 3: Trust erodes. Your team starts double-checking every AI output manually, which defeats the purpose.
The call: "You need our premium tier." That's the upsell. The flat fee was the foot in the door.

The fix isn't a more expensive model. It's the right model for each task. A system that routes contract analysis to Opus ($5/MTok) and FAQ responses to Flash-Lite ($0.10/MTok) costs less total than running everything on a mid-tier model, and produces better results at both ends.

How to Audit Your AI Vendor

Five questions to ask before signing, or renewing:

Which LLM powers each feature? If they can't name the model, that's a red flag. If they say "proprietary AI," that's usually a wrapper around someone else's model.
Can I see the model ID in logs or API responses? Transparency matters. If you're paying for GPT-5.4-level intelligence and getting Nano-level output, you should be able to verify.
What happens when a task exceeds the model's capability? Do they route to a more capable model? Or does it just... hallucinate and hope you don't notice?
Is there task routing or is everything on one model? Single-model architectures are the "flat fee" trap. Multi-model architectures with intelligent routing are what production AI actually looks like.
What's the actual per-token cost vs. the flat fee? Do the math. If their flat fee works out to $50/MTok effective cost and the underlying model costs $3/MTok, you're paying a 16x markup for a wrapper.

The Manus Problem: When You Can't See the Model

Manus, now owned by Meta, is the poster child for the black-box approach. It's an agent platform that takes your task and runs it. You pay credits. Something happens. You get a result.

What you don't get: any visibility into which model ran your task. Was it a frontier model that reasoned through your request? Or a budget model that pattern-matched and hoped for the best? You have no way to know, no way to verify, and no way to optimize.

For demos and personal experiments, that's fine. For production, where you need to explain why the AI made a specific recommendation, debug when it gets something wrong, or control costs at scale, it's a liability.

This is the extreme version of the vendor trap: you're not just locked into one model. You don't even know which model you're locked into. If your AI vendor can't tell you which model powers each feature, ask yourself what else they can't tell you.

Provider Quick Reference

Anthropic (Claude)

Model	Input/MTok	Output/MTok	Context	Best For
Opus 4.6	$5.00	$25.00	1M	Complex reasoning, agents, architecture
Sonnet 4.6	$3.00	$15.00	1M	Code, content, production workhorse
Haiku 4.5	$1.00	$5.00	200K	Fast classification, simple Q&A, chatbots

Source: Anthropic Model Documentation

OpenAI (GPT)

Model	Input/MTok	Output/MTok	Best For
GPT-5.4	$2.50	$15.00	Professional work, deep reasoning
GPT-5.4 mini	$0.75	$4.50	Code, subagents, mid-tier tasks
GPT-5.4 nano	$0.20	$1.25	High-volume simple tasks

Source: OpenAI API Pricing

Google (Gemini)

Model	Input/MTok	Output/MTok	Best For
Gemini 3.1 Pro	$2.00	$12.00	Complex tasks, long-context research
Gemini 3 Flash	$0.50	$3.00	Data extraction, structured output
Gemini 2.5 Flash-Lite	$0.10	$0.40	Budget classification, high-volume Q&A

Source: Google AI Pricing

xAI (Grok)

Model	Input/MTok	Output/MTok	Context	Best For
Grok 4.20 reasoning	$2.00	$6.00	2M	Advanced reasoning, multi-agent
Grok 4-1-fast	$0.20	$0.50	2M	Quick responses, cost efficiency

Source: xAI Model Documentation

DeepSeek

Model	Input/MTok	Output/MTok	Context	Best For
DeepSeek V3.2 chat	$0.28	$0.42	128K	Budget general use, structured output
DeepSeek V3.2 reasoner	$0.28	$0.42	128K	Budget reasoning with extended thinking

Source: DeepSeek API Pricing

Frequently Asked Questions

How do I decide which LLM to use?

Start with the task, not the model. Define what you need, reasoning, code generation, data extraction, content writing, or orchestration, then match to the appropriate model tier. Use the Task-Model Matrix above as your starting point, and always test with your actual workload before committing. The "best" model is the one that handles your specific task reliably at a cost you can sustain.
Which AI is best for coding?

For production code generation, Claude Sonnet 4.6 leads, fast, code-native, and reliable on multi-file edits at $3/MTok input. For complex architectural decisions and debugging, Claude Opus 4.6 with extended thinking. GPT-5.4 mini at $0.75/MTok is the best value if you need speed over depth. Avoid lightweight models (Nano, Flash-Lite) for code, they produce syntactically valid code with subtle logic errors that cost more to debug than you saved on tokens.
Which LLM is best for research?

It depends on the depth. For deep analysis across hundreds of pages, Claude Opus 4.6 with extended thinking and its 1M token context window. For quick fact extraction from structured documents, Gemini 2.5 Flash at $0.30/MTok handles it fine. For research needing real-time web information, GPT-5.4 with web search or Gemini with Google Search integration.
Is ChatGPT better than Claude or Gemini?

None of them is universally "better." Claude leads on coding and instruction-following. GPT-5.4 is strong on general professional work and has the broadest tool ecosystem. Gemini wins on cost efficiency and context window size. The right answer is using each where it's strongest, which is why single-model AI solutions underperform multi-model architectures. See the full comparison table above.
What is LLM task routing?

Task routing is the practice of directing different AI tasks to different models based on what each model does best. Instead of running everything on one expensive model (or one cheap model that hallucinates on complex tasks), you route reasoning to a frontier model, data extraction to a lightweight model, and code generation to a mid-tier model. Your total cost drops, quality goes up, and you stop overpaying for simple tasks or underpaying for complex ones.

This guide reflects production experience as of March 2026. LLM pricing and capabilities change frequently. I'll update this reference as models evolve. All pricing and capability claims link to official provider documentation.

I'm Tom Tokita. Co-Founder & President of Aether Global Technology Inc., a consulting firm in Manila. I route between 3-5 LLMs daily across production deployments. Have a question about which model fits your use case? Let's talk.