Josselin Guarnelli

Posted on Mar 28

We Scanned 16 AI Agent Repos. 76% of Tool Calls Had Zero Guards.

#security #ai #opensource #python

We scanned 16 open-source AI agent repositories — both agent frameworks (CrewAI, PraisonAI) and production agent applications (Skyvern, Dify, Khoj, and others) that ship real business logic.

76% of tool calls with real-world side effects had zero protective checks.

No rate limits. No input validation. No confirmation steps. No auth checks.

An important nuance: you'd expect framework code to lack guards — it's template code, and adding guards is the implementor's job. But the same pattern holds in production agent applications with real business logic. Skyvern (browser automation, 595 files): 76% unguarded. Dify (LLM platform, 1000+ files): 75% unguarded. The frameworks aren't the problem — the problem is that nobody adds guards when they build on top of them either.

This means a single prompt injection — or a simple hallucination — could trigger hundreds of unvalidated database writes, unchecked HTTP requests to arbitrary URLs, or file deletions without confirmation.

Here's what we found, how we found it, and how you can audit your own agent code in 60 seconds.

What We Scanned

We analyzed 16 open-source repos in two categories:

Agent applications — repos that ship real business logic: browser automation agents, AI assistants, LLM platforms with tool-calling capabilities. These are the repos where guards should exist because the code runs in production against real databases and APIs.

Agent frameworks — repos like CrewAI and PraisonAI that provide scaffolding for building agents. Framework code is intentionally generic — it exposes tool call patterns without business-specific guards, because that's the implementor's responsibility.

We report findings for both categories, but the story that matters is the application layer: even when developers build on top of frameworks and add their own logic, the guards don't show up.

For each repo, we asked a simple question: which functions can change the real world, and which ones have guards?

A "tool call with side effects" is any function that can:

Write to a database (session.commit(), .save(), .create())
Delete data (session.delete(), os.remove(), shutil.rmtree())
Make HTTP write requests (requests.post(), httpx.put())
Process payments (stripe.Charge.create())
Send emails or messages (smtp.sendmail(), slack_client.chat_postMessage())
Invoke another agent (graph.ainvoke(), agent.execute())
Execute dynamic code (exec(), eval(), importlib.import_module())

A "guard" is any check that protects that call:

Input validation (Field(le=10000), @validator)
Rate limiting (@rate_limit, @throttle)
Auth checks (Depends(), Security() in FastAPI)
Confirmation steps (confirm, approve in function body)
Idempotency (idempotency_key, get_or_create)
Retry bounds (max_retries=, @retry(stop=stop_after_attempt()))

The Results

By repo

Repo	Type	Files	Tool calls	Unguarded	%
Skyvern	Application	595	452	345	76%
Dify	Platform	1000+	1,009	759	75%
PraisonAI	Framework	—	1,028	911	89%
CrewAI	Framework	—	348	273	78%

Full results for all 16 repos: REALITY_CHECK_RESULTS.md

What we found unguarded

Across all repos, the most common unguarded categories were database writes, database deletes, HTTP write requests, subprocess/exec/eval calls, LLM calls, and email/messaging. The pattern is consistent: the more dangerous the action, the less likely it has guards.

A note on methodology: subprocess/exec/eval calls are a different class of risk — these should generally be eliminated entirely, not guarded. The scanner also prioritizes recall over precision: we'd rather flag a function that might be fine than miss one that isn't. Based on manual review, the false positive rate is roughly 15-20% — mostly from generic .save() calls that turn out to be config or file operations rather than database writes.

Why This Matters For AI Agents Specifically

You might think: "Unguarded function calls exist in every codebase. What makes agents special?"

The difference is who calls these functions.

In a traditional web app, a human user triggers actions through a UI with built-in constraints — forms with validation, buttons with confirmation dialogs, rate limits per session.

In an agent, an LLM decides which functions to call, with what arguments, how many times. The LLM doesn't know your business rules. It doesn't understand that calling refund() 200 times in a loop is catastrophic. And if an attacker crafts a prompt injection, the LLM will happily execute whatever functions it has access to — as many times as it's told.

Without guards in the code, there's nothing between the LLM's decision and the real-world consequence.

A concrete example from our scan: Khoj, an open-source AI assistant, exposes a function called ai_update_memories that lets the LLM delete and replace user memories. It calls session.delete() followed by session.add() with no confirmation, no rate limit, and no validation on the content. A single adversarial prompt could wipe and replace a user's entire memory store.

How We Built the Scanner

We built diplomat-agent, an AST-based static analyzer for Python. It uses Python's built-in ast module — zero required dependencies. Optional: rich for colored terminal output.

Why AST and not regex?

Regex pattern matching misses most real-world code patterns. A function call like db.session.commit() can appear as a direct call, nested inside a try/except, called through a variable alias, or buried three levels deep in a helper function. AST understands the code structure — it parses the actual syntax tree, not text patterns.

The scanner walks every Python file in your project (excluding tests, migrations, venv, examples, and other non-production directories), visits every function definition, and for each function:

Finds all calls that match side-effect patterns (DB writes, HTTP calls, deletes, payments, etc.)
Finds all guards in scope (validators, rate limits, auth checks, confirmation steps)
Outputs a verdict: UNGUARDED, PARTIALLY_GUARDED, GUARDED, or LOW_RISK (for read-only functions)

The default output is a terminal report showing every finding with its verdict. You can also generate a toolcalls.yaml registry (with --format registry) — a committable inventory of every function with side effects, the guards present or missing, and actionable hints.

Try It On Your Own Code

pip install diplomat-agent
diplomat-agent .

That's it. Zero config, zero required dependencies. Takes about 2-3 seconds on a 1000-file repo.

The output looks like this:

diplomat-agent — governance scan

Scanned: ./my-agent
Tools with side effects: 12

⚠ send_report(endpoint, payload)
  Rate limit:             NONE
  → Risk: agent could exhaust external API quota with 200 calls
  ⤷ no rate limit · no auth check
  Governance: ❌ UNGUARDED

⚠ send_notification(user_id, message)
  Rate limit:             NONE
  → Risk: agent could send 200 messages — spam risk
  ⤷ no rate limit · no auth check
  Governance: ❌ UNGUARDED

✓ process_order(order_id)
  Write protection:       Input Validation (FULL)
  Rate limit:             Rate Limit (FULL)
  Governance: ✅ GUARDED

────────────────────────────────────────────
RESULT: 8 with no checks · 3 with partial checks · 1 guarded (12 total)

What To Do When You Find Gaps

For each unguarded tool call, you have four options:

Fix it — add validation, rate limiting, or confirmation in code. The next scan picks it up automatically.

Acknowledge it — if the function is intentionally unguarded or protected elsewhere, add # checked:ok as a comment:

def send_alert(message):  # checked:ok — protected by API gateway
    requests.post(ALERT_URL, json={"msg": message})

Add it to CI — block PRs that introduce new unguarded tool calls:

diplomat-agent . --fail-on-unchecked

If you commit toolcalls.yaml as a baseline, only new findings block — no noise on legacy code.

Review the inventory — the toolcalls.yaml file is meant to be committed and reviewed in PRs. When someone adds a new function that can delete data, it shows up in the diff.

The Bigger Picture

We're building AI agents that can modify databases, send money, delete files, and call external APIs — and we're giving them zero guardrails in code.

The OWASP Top 10 for Agentic Applications (released December 2025) explicitly recommends maintaining a complete inventory of all agentic components, their permissions, and their capabilities. The EU AI Act (enforceable August 2026) requires documenting system capabilities and human oversight measures for high-risk AI systems.

toolcalls.yaml is a step toward that. It's not a complete governance solution — it's a starting point. You can't govern what you can't see.

DEV Community