DEV Community: Josselin Guarnelli

What I found scanning 3 AI agent codebases for unguarded tool calls

Josselin Guarnelli — Wed, 03 Jun 2026 07:55:29 +0000

669 functions that can write to a database, delete files, charge a card, spawn a subprocess, or hand control to another agent.

553 of them had no guard of any kind. No input validation, no auth check, no rate limit, no confirmation step. Nothing between the model's decision and the side effect.

That is 83%. None were confirmed.

I got these numbers by pointing a static analyzer at three open-source TypeScript AI agent codebases and counting. Not a pen test. Not a CVE hunt. An inventory of what each agent can do and which of those capabilities have a control in the code.

This is the methodology, the full table, and — the part I care about most — the false positives I had to eliminate before I trusted any of it.

Why an unguarded tool call is a different problem in an agent

In a normal web app, a human clicks a button. The path to a side effect runs through a form, a validation layer, a confirmation dialog, a session rate limit. The dangerous call is wrapped in UI and middleware that someone designed on purpose.

In an agent, an LLM decides which function to call, with which arguments, how many times. It does not know your business rules. It can loop, hallucinate an argument, or get talked into something by injected text in a tool result.

So the guard cannot live in the UI anymore. There is no UI. The guard has to live in the code, right next to the call.

The interesting question is no longer "is this app secure." It is: for every function the model can reach that does something real, is there a control in the code — and if not, do you know?

Most teams don't. Not because they're careless, but because nobody has an inventory. You cannot review what you cannot see.

What I actually measured

I wrote diplomat-agent-ts, a static scanner built on ts-morph (the TypeScript compiler API). It walks the AST, finds call expressions that match a catalog of side-effect patterns, and checks whether each one has a guard in the same function. Two runtime dependencies, no config file, ~9 seconds on a 7,874-file codebase.

A tool call here is any call that matches one of 40+ patterns across 12 side-effect categories:

payment · database_write · database_delete · http_write · email · messaging · agent_invocation · llm_call · publish · dynamic_code · file_delete · destructive

A guard is an in-file control the scanner can see syntactically: input validation (Zod, Yup, class-validator), a rate limit (a @Throttle decorator, a custom limiter), an auth check, a confirmation step, an idempotency key, a retry bound.

Each call lands in one of three states:

no_checks — a side effect with no guard at all
partial_checks — some coverage, but missing at least one expected control
confirmed — explicitly acknowledged with a // checked:ok annotation

One thing to flag up front, because it matters for reading the numbers: confirmed requires an annotation that is this scanner's own convention. None of the three projects has ever heard of it. So every external codebase shows zero confirmed by construction. That number is not an accusation. It's the floor.

I ran each scan against an unmodified public clone at a pinned commit, so the findings reproduce exactly. Every command is in the repo's MANIFEST.md.

And the framing that governs all of this: it's an inventory, not a score. A high no_checks count is a map of where to look, not a grade.

The numbers

Three codebases, four scopes (I split OpenAI's framework packages from its examples because they behave differently).

Codebase (scope)	Type	TS files	Tool calls	`no_checks`	`partial`
OpenClaw (`src/`)	Application	7,874	419	332 (79%)	87
Mastra (`packages/`)	Framework	2,777	185	162 (88%)	23
OpenAI Agents JS (`packages/`)	Framework	426	33	31 (94%)	2
OpenAI Agents JS (`examples/`)	Examples	302	32	28 (88%)	4
Total	—	11,379	669	553 (83%)	116

confirmed is zero across every scope, for the reason above.

The 83% is the headline, but the spread is the more honest story. The leanest, most deliberately-built codebase in the set — OpenAI's framework packages — still came out at 94% no_checks. That is not because the OpenAI team is sloppy. It's because guards mostly aren't where a static scanner looks. They live in middleware, in a gateway, in the runtime the framework expects you to wire up. The scanner sees the call site. It does not see the deployment.

Which is exactly the point. The gap between "what the model can reach" and "what has a visible control" is real in every one of these repos. The number just makes it countable.

What the categories reveal

Counting side effects by category across all four scopes (a single call can carry more than one):

Category	Occurrences
`destructive` (subprocess / shell)	486
`file_delete`	214
`publish`	124
`agent_invocation`	120
`http_write`	86
`llm_call`	3
`database_delete`	3
`dynamic_code`	1

The shape changes with the type of codebase. The application (OpenClaw) is dominated by destructive and file_delete — it's a tool that runs commands and manages files, so a huge fraction of its "tool calls" are the product, not a bug. The frameworks lean toward publish and agent_invocation — they hand control to other agents and ship artifacts, which is what frameworks do.

I'll say the uncomfortable part myself: destructive is the biggest category and also the one most prone to "well, that's literally the app's job." A shell runner runs shells. Flagging every execSync in it is technically correct and contextually obvious. That's why the output is an inventory you triage, not a verdict you act on blindly.

On the governance side, every finding gets tagged with OWASP Agentic codes. The distribution: ASI-02 (tool misuse, the baseline tag) fires on all 669; ASI-01 (excessive agency — a side effect with no auth check) on 576; ASI-03 (privilege compromise — high-stakes op with no confirmation) on 465. The runtime-only codes (supply chain, misalignment, deception) are deliberately out of scope for static analysis.

The hard part was not finding side effects. It was not over-counting them

Anyone can grep for .delete( and exec(. That gets you a number in five minutes and the number is garbage. The work is in making it not garbage.

The design rule that keeps this honest: patterns are data, the matcher is dumb on purpose. When a real-world false positive shows up, I fix the pattern catalog, never the matching logic. Every fix below is a commit with a regression test, not a tweak to a heuristic.

Four that mattered:

regex.exec() is not a subprocess. The destructive category caught exec in any file that imported child_process. Including RegExp.prototype.exec() on inline literals like /^extensions\/([^/]+)\//.exec(path). Pure string parsing, flagged as a shell spawn. Root cause was in the AST extraction: a regex-literal receiver fell through and produced a bare exec name, indistinguishable from exec(). Adding a RegularExpressionLiteral case dropped OpenClaw by 17 findings and removed six legitimately-innocent parsing functions from the report.

sandbox contains db. An early database_delete pattern matched objects named db. The string sandbox contains the substring d-b (san-db-ox), so SANDBOX_BACKEND_FACTORIES.delete() got logged as a database deletion. Substring matching on short generic names is fundamentally fragile. Fix: require the canonical receiver name (prisma) or an actual drizzle-orm import.

deploy is a verb that lives inside other words. Matching nameContains: ["deploy"] flagged cancelDeploy, getDeployStatus, listDeployments — query and management operations, not publish side effects — all over Mastra's deployer package. Switching to an exact match on a bare deploy() call removed 39 false positives in one commit. Manual audit confirmed all ten sampled were genuine FPs.

client.messages.create() is Anthropic, not Twilio. Same method name, completely different side effect. This is why ambiguous patterns carry an importContains condition: the pattern only fires if the disambiguating package is imported in the file. The ordering of the pattern table encodes priority — payments first, LLM calls before database writes — so client.chat.completions.create() never gets misfiled as a DB write.

I'd rather report 419 findings I can defend than 471 I have to apologize for. The validation pass on OpenClaw started at a 30% false-positive rate on a sampled audit. Killing that is the actual product.

What this does not tell you

The honest limitations, because a technical reader will find them anyway:

Unguarded is not the same as vulnerable. A flagged call can be completely safe — the guard might live in middleware, a gateway, or a layer the scanner can't see. The output tells you where to look, not what's broken.
It's static only. No runtime detection. If protection is enforced outside the file, the scanner can't know that unless you annotate it.
It's intra-procedural. Guard detection looks at the same function and its immediate decorators. A guard three call-frames away in another file won't be credited. Cross-function analysis is the next milestone, not a current claim.
It needs the import for ORM patterns. Mongoose, Sequelize, and TypeORM use generic method names (.save(), .create()), so those patterns are scoped to files that import the ORM. Re-exported models get missed.
confirmed is zero for external repos by construction. The annotation is this tool's convention. Read the zero as "nobody opted in," not "nobody bothered."

If you need runtime enforcement or semantic intent analysis, this is the wrong tool. It's a scanner. It reads code and counts.

Run it on your own agent

The interesting number is not mine. It's yours.

npm install -D @diplomat-ai/diplomat-agent-ts
npx diplomat-agent-ts scan .        # or ./src, ./packages

It prints a colored report. To get a committable inventory:

npx diplomat-agent-ts scan . --output-registry toolcalls.yaml

toolcalls.yaml is like package-lock.json, but for what your agent can do instead of what it depends on:

tool_calls:
  - function: chargeCustomer
    file: src/payments.ts
    line: 42
    actions:
      - "return stripe.charges.create({ amount, currency, customer })"
    checks: []
    missing:
      - no bounds on amount
      - no rate limit
      - no idempotency key
    owasp: [ASI-01, ASI-02, ASI-03, ASI-06]

Commit it. Diff it in PRs. When the agent gains a new capability, it shows up in review before it ships.

When a call is intentionally unguarded — or protected somewhere the scanner can't see — say so inline, and the next scan moves it to confirmed:

// checked:ok — protected by middleware/approval.ts
export async function chargeCustomer(amount: number, customerId: string) {
  return stripe.charges.create({ amount, currency: "usd", customer: customerId });
}

And to make new unguarded calls fail a build:

- name: Diplomat governance scan
  run: npx -y @diplomat-ai/diplomat-agent-ts scan . --fail-on-unchecked

The scanner is Apache-2.0, two dependencies, TypeScript-only. The benchmark artifacts above reproduce exactly at the pinned commits — every command is in the repo.

Repo and reproducible benchmarks: github.com/Diplomat-ai/diplomat-agent-ts

Run it on whatever you shipped last week. The 83% was three codebases I didn't write. I'm more curious what it says about the ones I did.

We Scanned 16 AI Agent Repos. 76% of Tool Calls Had Zero Guards.

Josselin Guarnelli — Sat, 28 Mar 2026 21:58:45 +0000

We scanned 16 open-source AI agent repositories — both agent frameworks (CrewAI, PraisonAI) and production agent applications (Skyvern, Dify, Khoj, and others) that ship real business logic.

76% of tool calls with real-world side effects had zero protective checks.

No rate limits. No input validation. No confirmation steps. No auth checks.

An important nuance: you'd expect framework code to lack guards — it's template code, and adding guards is the implementor's job. But the same pattern holds in production agent applications with real business logic. Skyvern (browser automation, 595 files): 76% unguarded. Dify (LLM platform, 1000+ files): 75% unguarded. The frameworks aren't the problem — the problem is that nobody adds guards when they build on top of them either.

This means a single prompt injection — or a simple hallucination — could trigger hundreds of unvalidated database writes, unchecked HTTP requests to arbitrary URLs, or file deletions without confirmation.

Here's what we found, how we found it, and how you can audit your own agent code in 60 seconds.

What We Scanned

We analyzed 16 open-source repos in two categories:

Agent applications — repos that ship real business logic: browser automation agents, AI assistants, LLM platforms with tool-calling capabilities. These are the repos where guards should exist because the code runs in production against real databases and APIs.

Agent frameworks — repos like CrewAI and PraisonAI that provide scaffolding for building agents. Framework code is intentionally generic — it exposes tool call patterns without business-specific guards, because that's the implementor's responsibility.

We report findings for both categories, but the story that matters is the application layer: even when developers build on top of frameworks and add their own logic, the guards don't show up.

For each repo, we asked a simple question: which functions can change the real world, and which ones have guards?

A "tool call with side effects" is any function that can:

Write to a database (session.commit(), .save(), .create())
Delete data (session.delete(), os.remove(), shutil.rmtree())
Make HTTP write requests (requests.post(), httpx.put())
Process payments (stripe.Charge.create())
Send emails or messages (smtp.sendmail(), slack_client.chat_postMessage())
Invoke another agent (graph.ainvoke(), agent.execute())
Execute dynamic code (exec(), eval(), importlib.import_module())

A "guard" is any check that protects that call:

Input validation (Field(le=10000), @validator)
Rate limiting (@rate_limit, @throttle)
Auth checks (Depends(), Security() in FastAPI)
Confirmation steps (confirm, approve in function body)
Idempotency (idempotency_key, get_or_create)
Retry bounds (max_retries=, @retry(stop=stop_after_attempt()))

The Results

By repo

Repo	Type	Files	Tool calls	Unguarded	%
Skyvern	Application	595	452	345	76%
Dify	Platform	1000+	1,009	759	75%
PraisonAI	Framework	—	1,028	911	89%
CrewAI	Framework	—	348	273	78%

Full results for all 16 repos: REALITY_CHECK_RESULTS.md

What we found unguarded

Across all repos, the most common unguarded categories were database writes, database deletes, HTTP write requests, subprocess/exec/eval calls, LLM calls, and email/messaging. The pattern is consistent: the more dangerous the action, the less likely it has guards.

A note on methodology: subprocess/exec/eval calls are a different class of risk — these should generally be eliminated entirely, not guarded. The scanner also prioritizes recall over precision: we'd rather flag a function that might be fine than miss one that isn't. Based on manual review, the false positive rate is roughly 15-20% — mostly from generic .save() calls that turn out to be config or file operations rather than database writes.

Why This Matters For AI Agents Specifically

You might think: "Unguarded function calls exist in every codebase. What makes agents special?"

The difference is who calls these functions.

In a traditional web app, a human user triggers actions through a UI with built-in constraints — forms with validation, buttons with confirmation dialogs, rate limits per session.

In an agent, an LLM decides which functions to call, with what arguments, how many times. The LLM doesn't know your business rules. It doesn't understand that calling refund() 200 times in a loop is catastrophic. And if an attacker crafts a prompt injection, the LLM will happily execute whatever functions it has access to — as many times as it's told.

Without guards in the code, there's nothing between the LLM's decision and the real-world consequence.

A concrete example from our scan: Khoj, an open-source AI assistant, exposes a function called ai_update_memories that lets the LLM delete and replace user memories. It calls session.delete() followed by session.add() with no confirmation, no rate limit, and no validation on the content. A single adversarial prompt could wipe and replace a user's entire memory store.

How We Built the Scanner

We built diplomat-agent, an AST-based static analyzer for Python. It uses Python's built-in ast module — zero required dependencies. Optional: rich for colored terminal output.

Why AST and not regex?

Regex pattern matching misses most real-world code patterns. A function call like db.session.commit() can appear as a direct call, nested inside a try/except, called through a variable alias, or buried three levels deep in a helper function. AST understands the code structure — it parses the actual syntax tree, not text patterns.

The scanner walks every Python file in your project (excluding tests, migrations, venv, examples, and other non-production directories), visits every function definition, and for each function:

Finds all calls that match side-effect patterns (DB writes, HTTP calls, deletes, payments, etc.)
Finds all guards in scope (validators, rate limits, auth checks, confirmation steps)
Outputs a verdict: UNGUARDED, PARTIALLY_GUARDED, GUARDED, or LOW_RISK (for read-only functions)

The default output is a terminal report showing every finding with its verdict. You can also generate a toolcalls.yaml registry (with --format registry) — a committable inventory of every function with side effects, the guards present or missing, and actionable hints.

Try It On Your Own Code

pip install diplomat-agent
diplomat-agent .

That's it. Zero config, zero required dependencies. Takes about 2-3 seconds on a 1000-file repo.

The output looks like this:

diplomat-agent — governance scan

Scanned: ./my-agent
Tools with side effects: 12

⚠ send_report(endpoint, payload)
  Rate limit:             NONE
  → Risk: agent could exhaust external API quota with 200 calls
  ⤷ no rate limit · no auth check
  Governance: ❌ UNGUARDED

⚠ send_notification(user_id, message)
  Rate limit:             NONE
  → Risk: agent could send 200 messages — spam risk
  ⤷ no rate limit · no auth check
  Governance: ❌ UNGUARDED

✓ process_order(order_id)
  Write protection:       Input Validation (FULL)
  Rate limit:             Rate Limit (FULL)
  Governance: ✅ GUARDED

────────────────────────────────────────────
RESULT: 8 with no checks · 3 with partial checks · 1 guarded (12 total)

What To Do When You Find Gaps

For each unguarded tool call, you have four options:

Fix it — add validation, rate limiting, or confirmation in code. The next scan picks it up automatically.

Acknowledge it — if the function is intentionally unguarded or protected elsewhere, add # checked:ok as a comment:

def send_alert(message):  # checked:ok — protected by API gateway
    requests.post(ALERT_URL, json={"msg": message})

Add it to CI — block PRs that introduce new unguarded tool calls:

diplomat-agent . --fail-on-unchecked

If you commit toolcalls.yaml as a baseline, only new findings block — no noise on legacy code.

Review the inventory — the toolcalls.yaml file is meant to be committed and reviewed in PRs. When someone adds a new function that can delete data, it shows up in the diff.

The Bigger Picture

We're building AI agents that can modify databases, send money, delete files, and call external APIs — and we're giving them zero guardrails in code.

The OWASP Top 10 for Agentic Applications (released December 2025) explicitly recommends maintaining a complete inventory of all agentic components, their permissions, and their capabilities. The EU AI Act (enforceable August 2026) requires documenting system capabilities and human oversight measures for high-risk AI systems.

toolcalls.yaml is a step toward that. It's not a complete governance solution — it's a starting point. You can't govern what you can't see.