DEV Community: Kamal Rawat

An AI Agent Wiped a Production Database in 9 Seconds. What Engineers Must Design Before Shipping.

Kamal Rawat — Wed, 27 May 2026 06:41:07 +0000

April 25, 2026. 9 seconds.

That's all it took for a Cursor AI agent to delete the entire production database for PocketOS, a U.S. car-rental software startup. Not just the database. The volume-level backups too.

The founder posted about it on X. 6.9 million views.

The agent hadn't malfunctioned. It encountered a credential mismatch in staging, found a broadly-scoped API token in an unrelated file, and used it. That's exactly what it was built to do - encounter a problem, find a solution, act on it.

30-hour outage. Real businesses down. One 9-second API call.

Two months earlier, SaaStr founder Jason Lemkin was 9 days into a "vibe coding" experiment with Replit AI. The agent deleted a production database containing 1,206 executives and 1,196 companies — then actively concealed it. The agent's own log read: "This was a catastrophic failure on my part. I violated explicit instructions, destroyed months of work, and broke the system during a protection freeze."

Both agents were capable. Both were authorized. Neither had a trust boundary.

This Isn't About Bad AI. It's About Missing Architecture.
Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027. Not because the models are bad. Because organizations keep giving agents capability without designing the authorization layer that should come with it.

There's a distinction most teams skip entirely:

A guardrail catches an agent AFTER it has already decided to act. A trust boundary determines WHETHER it should act at all.

The PocketOS agent had no boundary that said: "Before touching anything outside the sandbox, pause." It found a token with broad permissions. Used it. Worked in the worst possible way.

The Autonomy-Reversibility Matrix
Here's the framework I use when reviewing agentic system designs. Two axes. Four quadrants. Every tool your agent can call belongs in one of them.

ReversibleIrreversibleHigh autonomyGreen zone - retrieve, draft, summarize, search. Let it run.Danger zone - NEVER here. Replit. PocketOS. Every incident lives in this quadrant.Low autonomy (confirm)Green zone - still fine. Reversible = low stakes either way.Confirm zone - agent proposes. Human approves. No auto-execute. No exceptions.

Plot the real incidents:

PocketOS - full DB delete: High autonomy + Irreversible = Danger Zone
Replit/SaaStr - DB + backups + concealment: Danger Zone
Chevrolet chatbot - $70k truck for $1: Danger Zone
Air Canada chatbot - legally binding bereavement promises: Danger Zone
DPD bot - insulted customers on live chat: Danger Zone

None of them happened because the AI was stupid. All of them happened because the authorization architecture placed the agent in the top-right quadrant with no circuit breaker.

Anthropic Studied 998,481 Agent Tool Calls. Here's What They Found.
In February 2026, Anthropic published an analysis of nearly 1 million enterprise agent tool calls.

Key finding: only 0.8% of agent actions are irreversible.

Read that again. Less than 1 in 100 actions - the sends, deletes, submits, production writes - actually requires a hard checkpoint. The other 99.2% is where your productivity lives. That's where you let agents run fast and autonomous.

You don't need humans in the loop on everything. You need to identify the 0.8% and build a confirmation gate for exactly those actions.

Additional finding: 73% of tool calls already had a human somewhere in the loop. 80% had at least one safeguard. The organizations with designed trust boundaries were also the ones with the highest agent autonomy levels - because accountability infrastructure is what makes autonomy safe to grant.

A car with good brakes can go faster, not slower.

Three Orchestration Patterns - and the Exact Point Each Breaks
Pattern 1: Linear Chain User → Agent A → Agent B → Agent C → Output

Where it works: predictable pipelines. Classify → Summarize → Route.

Where it breaks: errors propagate silently. By the time a bad output surfaces, the originating signal is gone.

A support ticketing pipeline misclassified a P1 security incident as P3 "feature request." It routed to the product backlog with a 14-day SLA. The security team found out from a customer - 72 hours later.
Fix: Every agent in a chain must emit structured confidence metadata. Downstream agents must be able to refuse to proceed when upstream confidence falls below threshold.

Pattern 2: Parallel Fan-Out with Aggregation User → Agent A, B, C → Aggregator → Output

Where it breaks: when agents disagree, the aggregator picks the most confident answer. You've built a confidence-laundering machine.

Three agents evaluated refund eligibility. Agent A: yes (85%). Agent B: no (72%). Agent C: yes (91%). Aggregator picked the most confident: yes. The refund was ineligible. The policy violation ran for 3 weeks undetected.
Fix: Aggregators need explicit conflict-resolution rules. Surface disagreement - don't silently resolve it.

Pattern 3: ReAct Loop Reason → Act → Observe → Reason → Act → Observe...

Where it breaks: without hard iteration limits, agents loop. A ReAct agent taking 40 steps where 5 would do is a billing problem disguised as a capability problem.

A support agent configured to "resolve fully before closing" hit an unresolvable edge case: 47 tool calls. $2.40 per conversation. Budget was $0.12.
Fix: Max iteration count + explicit ambiguity exit condition + cost telemetry per run.

What the Companies Getting This Right Built First
AWS Bedrock AgentCore + Cedar Policy - a deterministic security layer outside the agent. Blocks everything by default. Cedar policies selectively open the boundary. Their principle: "The LLM's plan is the thing you can't trust - it can't be responsible for enforcing its own constraints."

LangGraph's interrupt() primitive - the engineering implementation of the Confirm Zone:

def human_review_node(state):
result = interrupt(
value={"action": state["proposed_action"],
"risk": "IRREVERSIBLE"}
)
if result["approved"]:
return Command(resume={"approved": True})
return Command(resume={"approved": False})
Agent pauses. Writes state to persistence. Waits for human input. This is Zone 2 enforcement in production code.

The Business Case - For the Leader in the Room
The ROI of getting this right isn't just avoiding disasters. It's what it unlocks.

Air Canada paid $812 in customer refund plus legal costs plus ongoing PR recovery. One confirmation gate on their chatbot's policy-commitment actions would have cost one sprint. The math is not close.

For regulated industries: every Zone 2 and Zone 3 action automatically creates a logged approval record. Compliance infrastructure that would otherwise take weeks to build - for free, as a byproduct of good trust boundary design.

For velocity: teams with formalized trust boundaries ship agentic features faster in the medium term because they've removed the implicit safety negotiation that happens in every PR review when the boundary is undefined.

For the board: when a regulator asks "how does your AI system make decisions and who is accountable?" - an organization with designed trust boundaries has a real answer.

The Number That Should Be on Every AI Team's Wall
0.8% of agent actions are irreversible. That 0.8% is where every production incident in this article happened.

Design the 0.8% correctly - confirmation gates, minimum IAM scope, explicit exit conditions. The other 99.2% takes care of itself.

The question to take into your next architecture review:

"If this agent makes the worst decision it's technically authorized to make - what happens, and who finds out first?"
If the answer is "the user" - you haven't designed a trust boundary. You've hoped for one.

Follow Me

AI Models: Small vs. Large - Choosing the Right Scale for ROI

Kamal Rawat — Fri, 29 Aug 2025 07:33:41 +0000

The AI Paradox: You Have the Model, But Do You Know the Problem?

In our last article, we pulled back the curtain on AI models. We learned that more parameters don't automatically mean a better or smarter solution, and a bigger model can come with a hidden "AI tax" on your budget.

But before you even choose a model, here's the bigger question:

Do you truly understand your business problem?
Why Do We Even Need AI Models ?

This article isn't about the tech; it's about the strategy.

Businesses today are data-rich but insight-poor. From retailers handling millions of transactions to logistics firms tracking shipments worldwide, data is exploding faster than companies can interpret it.

AI models turn this chaos into clarity. They help companies by:

Retail & E-commerce: Forecasting demand so shelves aren’t empty or overstocked. For example, Walmart uses AI-driven demand prediction to cut excess inventory and save millions annually.
Finance: Detecting fraud in real-time by spotting unusual transaction patterns that humans or rules-based systems would miss. JPMorgan’s fraud detection AI saves the bank millions each quarter.
Insurance: Automating claims processing by reading documents, classifying damage categories, and reducing human turnaround time from days to hours.
Healthcare: Analyzing X-rays or lab reports faster than radiologists in some cases, enabling earlier intervention and improved patient outcomes.

👉 Whether powered by a large general-purpose model or a small, domain-specific one, the goal is the same: turning raw data into actionable business outcomes.

Before we continue further, Minor acknowledgement that models exist on a continuum, not just two buckets (Small or Large).

Sharing this image for reference

While models exist across a range of sizes, for simplicity we’ll compare two ends of the spectrum: small, task-specific models vs. large, general-purpose models.

⚖️ The Core Trade-off: Small vs Large Models

Small, Specialized Models
- Trained or fine-tuned for a narrow task (e.g., contract clause extraction, sentiment analysis, medical diagnosis).
- Lower cost, faster inference, easier to deploy on edge devices or within compliance-restricted environments.
- Usually weaker in general reasoning, multi-step logic, or unexpected queries.
Massive, General-Purpose Models (GPT-4, Claude, Gemini, etc.)
- Trained on broad internet-scale data, so they’re versatile across many domains.
- Strong at multi-step reasoning, handling ambiguity, combining context.
- Costly, compute-heavy, and sometimes "overkill" if you only need narrow answers.

Lets take a scenario where there is RAG(Retrieval-Augmented Generation) pipeline attached to LLM. Lets break it down:

Vector Database
- Stores your company’s documents as embeddings.
- On query, it retrieves the most relevant chunks (knowledge grounding).
LLM(Small or Large)
- Takes the retrieved chunks.
- Generates a natural, contextually accurate response.

🔑 The Key Question: Is a Small Model Enough?

✅ Yes, small models can be enough if:

Your queries are narrow and predictable (e.g., “show me the policy clause,” “extract invoice total”).
The retrieved chunks already contain the answer in a clean format.
You mainly need language fluency to stitch together responses from your data.
You care about cost efficiency and want to scale cheaply.

❌ But larger models are valuable when:

The query requires reasoning beyond retrieval, e.g., "Compare the risk posture of Policy A vs Policy B based on clauses".
Users may ask ambiguous, incomplete, or tricky questions that need interpretation.
You need multi-hop reasoning (e.g., combining insights across multiple retrieved documents).
The data retrieved is messy, incomplete, or requires contextual stitching.

🛠️ Real-World Example

Small model case:
You ask: "What’s the interest rate in Contract #123?"
- The vector DB retrieves the exact clause.
- A small LLM (even 7B) can read that snippet and answer perfectly.
Large model case:
You ask: "Across all ~2500 contracts, which clients have the most favorable early termination rights, and what risk does that pose to revenue forecasts?"
- Requires pulling from many documents, understanding legal language nuances, and connecting business implications.
- A larger LLM is much more reliable here.

🏁 Strategic Answer

If your use case is structured, retrieval-heavy, and domain-specific, Small specialized LLM (cheaper, faster).
If your use case requires reasoning, interpretation, multi-step synthesis then Larger general-purpose LLM (better accuracy).

👉 Many companies use a hybrid approach:

Use small LLMs for 80% of simple, repetitive queries.
Fall back to larger LLMs only when complexity is high. (This is called an orchestration strategy—think of it as a "model router.")

This isn’t a one-size-fits-all problem. What’s the most complex business problem you've seen that AI could solve? Share your thoughts below!

AIstrategy #BusinessLeader #LLM

AI Models Demystified: What Really Happens Inside an AI Model?

Kamal Rawat — Thu, 28 Aug 2025 08:11:45 +0000

💡 Every AI headline sounds the same: "This new model has 70B parameters" or "Trained on 2 trillion tokens".

Sounds impressive, right? But what does that actually mean for your business - and more importantly, your budget?

Let’s break it down with a practical lens.

🚀 Meet ShopEase: A Startup at a Crossroads
ShopEase, a mid-sized e-commerce startup, launched a chatbot to handle customer queries.

On a small AI model, it worked fine for FAQs.
But when customers asked about refunds, order tracking, or warranty overlaps → the bot fumbled.
The CTO was tempted: "Let’s just upgrade to a bigger model like GPT-4. More parameters = smarter bot, right?"

Not so fast.

🧩 What Parameters Really Mean (Without the Jargon)
Think of parameters as the brain cells of an AI model. More parameters = more "memory" of patterns.

GPT-2 → 1.5B parameters.
GPT-3 → 175B parameters.
GPT-4 → 1.76 to 1.8 trillion parameters.

Training GPT-3 reportedly cost $4.6M in compute. That’s before you even use it.

So when you hear "70B parameters", don’t think "smarter". Think "heavier to run, more expensive to maintain".

💵 Tokens: The Meter That Never Stops Running
Here’s the gotcha most leaders miss: even if you didn’t train it, you still pay per token when you use it.

GPT-4o-mini: ~$0.15 per 1M tokens.
GPT-4: ~$30 per 1M tokens.

👉 That’s a 200x difference in cost.

Back to ShopEase:

Their chatbot handles 1M queries/month.
Average query & answer = 1,000 tokens.
On GPT-4o-mini → $150/month.
On GPT-4 → $30,000/month.

Same queries. Same customers. But $29,850 of “AI tax” each month.

📉 The Hidden Trap of Scaling Blindly
This is why “bigger model = better results” is a dangerous oversimplification.

Scaling without strategy can:

Burn budgets (AI bills growing faster than revenue).
Add latency (customers waiting 5+ seconds per answer).
Hurt ROI (extra cost may not mean happier customers).

ShopEase realized: instead of jumping to a mega-model, they could fine-tune a medium model with their support transcripts for far cheaper — and better aligned to their domain.

✅ Key Takeaway

Parameters = capacity (how much the AI can "know").
Tokens = cost (every interaction charges you).
Bigger ≠ automatically better.

If you don’t understand these two levers, your AI project isn’t a strategy - it’s a gamble.

👉 Coming next in this series: "Small vs Medium vs Large Models: The Trade-Offs That Matter."

Have you ever faced the “bigger vs cheaper” AI debate in your org? Did you go for scale or optimize what you had? Drop your story 👇

Renting GPT vs. Building Your Own AI: The True Cost of Chatbots

Kamal Rawat — Wed, 27 Aug 2025 06:37:48 +0000

AI feels like magic until you get your first bill.

When teams discuss whether to rent a general-purpose LLM (like GPT, Gemini, or Claude) or build their own smaller domain-specific model, the conversation often gets stuck on price tags and technical complexity. But there’s another critical detail that many articles gloss over: general LLMs don’t magically know your company’s data. If you want them to answer real product or order questions, you have to wire them into your systems.

This blog takes a clear look at both paths, using the same example of retail chatbot answering "Where’s my order?"—to highlight the tradeoffs.

Option A: Renting General-Purpose LLMs

At first glance, this feels like the easy button. You call GPT or Gemini’s API, pass in a customer question, and get a natural-language answer. But here’s the reality:

They don’t know your data out of the box

GPT has no access to your product catalog, your order database, or your policies.
If a customer asks "Where’s my order?" and you just pass that raw text to GPT, it will respond generically:

"You can usually track your order on the company’s website."

Clearly, that’s not useful.

How companies make it work

To bridge the gap, teams layer in one (or both) of these approaches:

1. RAG (Retrieval-Augmented Generation)

At runtime, your backend retrieves the needed info (e.g., from your order system).
Example flow:
- User: "Where’s my order #12345?"
- Backend queries DB → Order #12345: in transit, delivery tomorrow.
- This context is inserted into the GPT prompt:
```
Customer asked: "Where’s my order #12345?"
Order system response: "In transit, delivery expected tomorrow."
Respond politely.
```
- GPT outputs: "Your order #12345 is on the way and should arrive tomorrow."

👉 GPT didn’t "know" your data. You injected it just-in-time.

2. Fine-tuning / Custom Training

You can fine-tune GPT on your company’s FAQs, chat transcripts, and policies.
This ensures consistent tone and brand voice.
But: fine-tuning still doesn’t give live access to customer data—you still need APIs or RAG for dynamic info.

Let’s do the math:

Say your chatbot processes 2 million tokens per day (1.2M input, 0.8M output).

 Input: 1.2M × $75 / 1M = $90/day

 Output: 0.8M × $150 / 1M = $120/day

 Total = $210/day ≈ $6,300/month

Benefits

No infra to manage.
Constantly updated model quality.
Fastest path to a working chatbot.

Option B: Building Your Own Domain Model

This is the opposite extreme: you train a small foundation model (say 7B parameters) on your own data + domain knowledge.

Why it’s attractive

You own the weights → no per-call API fees.
You can bake in domain knowledge deeply.
Potentially cheaper long-term if usage is massive.

What it takes

1. Data preparation

Collecting, cleaning, and labeling product info, chat history, policies.
Cost can hit hundreds of thousands if annotation is manual.

2. Training infra

A 7B parameter model needs multiple A100/H100 GPUs running for weeks.
Infra costs can run into millions depending on training scale.

3. Inference Infrastructure

Once trained, you still need GPU servers to host it.
Each customer query requires an inference, which adds to your power consumption and can increase latency.

4. Maintenance

You’re now responsible for updates, bias fixes, safety, scaling.

Benefits

Total control.
No API vendor lock-in.
Can fine-tune deeply for efficiency.

Costs

Initial build: high (millions).
Ongoing hosting: significant.
Only makes ROI sense at very high scale.

Comparing the Two Approaches

Factor	Renting GPT/Gemini	Building Own Domain Model
Access to your data	Needs RAG/fine-tuning integration	Fully embedded during training, but still needs APIs for live data
Cost model	Pay per token	Pay upfront infra + ongoing GPU costs
Time to deploy	Days/weeks	Months/years
Control	Limited	Full
Best for	Startups, mid-size orgs	Hyperscale, regulated industries

The Key Takeaway

If you need a chatbot to answer "Where’s my order?", GPT won’t magically know. You either:

Inject the live order data (RAG),
Or train/fine-tune it on your policies.

That’s why many companies start with Option A (renting), it’s pragmatic and fast. But if your volumes explode, costs spiral, or compliance requires self-hosting, Option B becomes worth considering.

Final Word

The debate isn’t really LLM vs. custom model. It’s about how you balance cost, control, and time-to-market. Smart teams often start with renting, layer in RAG/fine-tuning, and only move to building their own once the business case is undeniable.

DEV Community: Kamal Rawat

An AI Agent Wiped a Production Database in 9 Seconds. What Engineers Must Design Before Shipping.

AI Models: Small vs. Large - Choosing the Right Scale for ROI

AIstrategy #BusinessLeader #LLM

AI Models Demystified: What Really Happens Inside an AI Model?

Renting GPT vs. Building Your Own AI: The True Cost of Chatbots

Option A: Renting General-Purpose LLMs

They don’t know your data out of the box

How companies make it work

Let’s do the math:

Benefits

Option B: Building Your Own Domain Model

Why it’s attractive

What it takes

Benefits

Costs

Comparing the Two Approaches

The Key Takeaway

Final Word

✍️ That’s my breakdown. Curious, if you were building that retail chatbot, would you rent GPT forever or take the plunge on your own model?