DEV Community

Josselin Guarnelli
Josselin Guarnelli

Posted on • Originally published at github.com

What I found scanning 3 AI agent codebases for unguarded tool calls

669 functions that can write to a database, delete files, charge a card, spawn a subprocess, or hand control to another agent.

553 of them had no guard of any kind. No input validation, no auth check, no rate limit, no confirmation step. Nothing between the model's decision and the side effect.

That is 83%. None were confirmed.

I got these numbers by pointing a static analyzer at three open-source TypeScript AI agent codebases and counting. Not a pen test. Not a CVE hunt. An inventory of what each agent can do and which of those capabilities have a control in the code.

This is the methodology, the full table, and — the part I care about most — the false positives I had to eliminate before I trusted any of it.

Why an unguarded tool call is a different problem in an agent

In a normal web app, a human clicks a button. The path to a side effect runs through a form, a validation layer, a confirmation dialog, a session rate limit. The dangerous call is wrapped in UI and middleware that someone designed on purpose.

In an agent, an LLM decides which function to call, with which arguments, how many times. It does not know your business rules. It can loop, hallucinate an argument, or get talked into something by injected text in a tool result.

So the guard cannot live in the UI anymore. There is no UI. The guard has to live in the code, right next to the call.

The interesting question is no longer "is this app secure." It is: for every function the model can reach that does something real, is there a control in the code — and if not, do you know?

Most teams don't. Not because they're careless, but because nobody has an inventory. You cannot review what you cannot see.

What I actually measured

I wrote diplomat-agent-ts, a static scanner built on ts-morph (the TypeScript compiler API). It walks the AST, finds call expressions that match a catalog of side-effect patterns, and checks whether each one has a guard in the same function. Two runtime dependencies, no config file, ~9 seconds on a 7,874-file codebase.

A tool call here is any call that matches one of 40+ patterns across 12 side-effect categories:

payment · database_write · database_delete · http_write · email · messaging · agent_invocation · llm_call · publish · dynamic_code · file_delete · destructive

A guard is an in-file control the scanner can see syntactically: input validation (Zod, Yup, class-validator), a rate limit (a @Throttle decorator, a custom limiter), an auth check, a confirmation step, an idempotency key, a retry bound.

Each call lands in one of three states:

  • no_checks — a side effect with no guard at all
  • partial_checks — some coverage, but missing at least one expected control
  • confirmed — explicitly acknowledged with a // checked:ok annotation

One thing to flag up front, because it matters for reading the numbers: confirmed requires an annotation that is this scanner's own convention. None of the three projects has ever heard of it. So every external codebase shows zero confirmed by construction. That number is not an accusation. It's the floor.

I ran each scan against an unmodified public clone at a pinned commit, so the findings reproduce exactly. Every command is in the repo's MANIFEST.md.

And the framing that governs all of this: it's an inventory, not a score. A high no_checks count is a map of where to look, not a grade.

The numbers

Three codebases, four scopes (I split OpenAI's framework packages from its examples because they behave differently).

Codebase (scope) Type TS files Tool calls no_checks partial
OpenClaw (src/) Application 7,874 419 332 (79%) 87
Mastra (packages/) Framework 2,777 185 162 (88%) 23
OpenAI Agents JS (packages/) Framework 426 33 31 (94%) 2
OpenAI Agents JS (examples/) Examples 302 32 28 (88%) 4
Total 11,379 669 553 (83%) 116

confirmed is zero across every scope, for the reason above.

The 83% is the headline, but the spread is the more honest story. The leanest, most deliberately-built codebase in the set — OpenAI's framework packages — still came out at 94% no_checks. That is not because the OpenAI team is sloppy. It's because guards mostly aren't where a static scanner looks. They live in middleware, in a gateway, in the runtime the framework expects you to wire up. The scanner sees the call site. It does not see the deployment.

Which is exactly the point. The gap between "what the model can reach" and "what has a visible control" is real in every one of these repos. The number just makes it countable.

What the categories reveal

Counting side effects by category across all four scopes (a single call can carry more than one):

Category Occurrences
destructive (subprocess / shell) 486
file_delete 214
publish 124
agent_invocation 120
http_write 86
llm_call 3
database_delete 3
dynamic_code 1

The shape changes with the type of codebase. The application (OpenClaw) is dominated by destructive and file_delete — it's a tool that runs commands and manages files, so a huge fraction of its "tool calls" are the product, not a bug. The frameworks lean toward publish and agent_invocation — they hand control to other agents and ship artifacts, which is what frameworks do.

I'll say the uncomfortable part myself: destructive is the biggest category and also the one most prone to "well, that's literally the app's job." A shell runner runs shells. Flagging every execSync in it is technically correct and contextually obvious. That's why the output is an inventory you triage, not a verdict you act on blindly.

On the governance side, every finding gets tagged with OWASP Agentic codes. The distribution: ASI-02 (tool misuse, the baseline tag) fires on all 669; ASI-01 (excessive agency — a side effect with no auth check) on 576; ASI-03 (privilege compromise — high-stakes op with no confirmation) on 465. The runtime-only codes (supply chain, misalignment, deception) are deliberately out of scope for static analysis.

The hard part was not finding side effects. It was not over-counting them

Anyone can grep for .delete( and exec(. That gets you a number in five minutes and the number is garbage. The work is in making it not garbage.

The design rule that keeps this honest: patterns are data, the matcher is dumb on purpose. When a real-world false positive shows up, I fix the pattern catalog, never the matching logic. Every fix below is a commit with a regression test, not a tweak to a heuristic.

Four that mattered:

regex.exec() is not a subprocess. The destructive category caught exec in any file that imported child_process. Including RegExp.prototype.exec() on inline literals like /^extensions\/([^/]+)\//.exec(path). Pure string parsing, flagged as a shell spawn. Root cause was in the AST extraction: a regex-literal receiver fell through and produced a bare exec name, indistinguishable from exec(). Adding a RegularExpressionLiteral case dropped OpenClaw by 17 findings and removed six legitimately-innocent parsing functions from the report.

sandbox contains db. An early database_delete pattern matched objects named db. The string sandbox contains the substring d-b (san-db-ox), so SANDBOX_BACKEND_FACTORIES.delete() got logged as a database deletion. Substring matching on short generic names is fundamentally fragile. Fix: require the canonical receiver name (prisma) or an actual drizzle-orm import.

deploy is a verb that lives inside other words. Matching nameContains: ["deploy"] flagged cancelDeploy, getDeployStatus, listDeployments — query and management operations, not publish side effects — all over Mastra's deployer package. Switching to an exact match on a bare deploy() call removed 39 false positives in one commit. Manual audit confirmed all ten sampled were genuine FPs.

client.messages.create() is Anthropic, not Twilio. Same method name, completely different side effect. This is why ambiguous patterns carry an importContains condition: the pattern only fires if the disambiguating package is imported in the file. The ordering of the pattern table encodes priority — payments first, LLM calls before database writes — so client.chat.completions.create() never gets misfiled as a DB write.

I'd rather report 419 findings I can defend than 471 I have to apologize for. The validation pass on OpenClaw started at a 30% false-positive rate on a sampled audit. Killing that is the actual product.

What this does not tell you

The honest limitations, because a technical reader will find them anyway:

  • Unguarded is not the same as vulnerable. A flagged call can be completely safe — the guard might live in middleware, a gateway, or a layer the scanner can't see. The output tells you where to look, not what's broken.
  • It's static only. No runtime detection. If protection is enforced outside the file, the scanner can't know that unless you annotate it.
  • It's intra-procedural. Guard detection looks at the same function and its immediate decorators. A guard three call-frames away in another file won't be credited. Cross-function analysis is the next milestone, not a current claim.
  • It needs the import for ORM patterns. Mongoose, Sequelize, and TypeORM use generic method names (.save(), .create()), so those patterns are scoped to files that import the ORM. Re-exported models get missed.
  • confirmed is zero for external repos by construction. The annotation is this tool's convention. Read the zero as "nobody opted in," not "nobody bothered."

If you need runtime enforcement or semantic intent analysis, this is the wrong tool. It's a scanner. It reads code and counts.

Run it on your own agent

The interesting number is not mine. It's yours.

npm install -D @diplomat-ai/diplomat-agent-ts
npx diplomat-agent-ts scan .        # or ./src, ./packages
Enter fullscreen mode Exit fullscreen mode

It prints a colored report. To get a committable inventory:

npx diplomat-agent-ts scan . --output-registry toolcalls.yaml
Enter fullscreen mode Exit fullscreen mode

toolcalls.yaml is like package-lock.json, but for what your agent can do instead of what it depends on:

tool_calls:
  - function: chargeCustomer
    file: src/payments.ts
    line: 42
    actions:
      - "return stripe.charges.create({ amount, currency, customer })"
    checks: []
    missing:
      - no bounds on amount
      - no rate limit
      - no idempotency key
    owasp: [ASI-01, ASI-02, ASI-03, ASI-06]
Enter fullscreen mode Exit fullscreen mode

Commit it. Diff it in PRs. When the agent gains a new capability, it shows up in review before it ships.

When a call is intentionally unguarded — or protected somewhere the scanner can't see — say so inline, and the next scan moves it to confirmed:

// checked:ok — protected by middleware/approval.ts
export async function chargeCustomer(amount: number, customerId: string) {
  return stripe.charges.create({ amount, currency: "usd", customer: customerId });
}
Enter fullscreen mode Exit fullscreen mode

And to make new unguarded calls fail a build:

- name: Diplomat governance scan
  run: npx -y @diplomat-ai/diplomat-agent-ts scan . --fail-on-unchecked
Enter fullscreen mode Exit fullscreen mode

The scanner is Apache-2.0, two dependencies, TypeScript-only. The benchmark artifacts above reproduce exactly at the pinned commits — every command is in the repo.

Repo and reproducible benchmarks: github.com/Diplomat-ai/diplomat-agent-ts

Run it on whatever you shipped last week. The 83% was three codebases I didn't write. I'm more curious what it says about the ones I did.

Top comments (0)