Brian Hall

Posted on Jun 22

Put a hard stop in front of your CrewAI crew's tool calls

#ai #agents #crewai #security

CrewAI makes it easy to stand up a crew. You give a few agents roles, hand them tools, let them delegate work to each other, and the thing mostly runs itself. That autonomy is the appeal. It's also the problem. Once a crew is moving, every agent in it can reach for a tool, and there's nothing between the model deciding to call something and the call actually happening.

The usual fix is a careful prompt and crossed fingers. Or a second LLM that "reviews" the action, which is hoping with extra latency. I wanted a check that doesn't depend on a model being in a good mood: something deterministic that runs before the tool call fires and gives a real answer. Allow it, hold it for a human, or block it.

That's what Faramesh does. It's open source, it works with CrewAI through a one-line wrapper, and this is the actual end-to-end setup, every command and every policy snippet pulled straight from how the tool really works.

The idea

A tool call is the moment an agent stops reasoning and starts doing. Reading a doc is one thing. Spending money, sending mail to a customer, or hitting a production API is another. Those are the moments worth putting a rule in front of.

Faramesh runs as a local daemon. Your whole policy lives in one file, governance.fms, and the daemon checks every tool call against it before the call runs. No LLM sits in that decision path, so the same action under the same policy always gets the same verdict. You get one of three:

permit the call runs
defer the call pauses and waits for a human to approve or reject
deny the call is blocked before it happens

The point is that it's deterministic. You can read the policy, reason about it, and know what it'll do. That's the whole difference from asking a second model to babysit the first one.

Install

Install the CLI:

curl -fsSL https://raw.githubusercontent.com/faramesh/faramesh-core/main/install.sh | bash
faramesh --version

Then add the SDK to your CrewAI project:

pip install faramesh-sdk crewai

Generate the policy

From the root of your project:

faramesh init

Faramesh inspects the repo, finds your framework, discovers your tools, and writes a starter governance.fms. The important part of the default: every discovered tool starts at defer. Nothing runs until you've reviewed it. That's the safe direction to fail.

Wire the crew

This is the only step that's CrewAI-specific, and it's small. You wrap each agent's tools in a GovernedToolSet and give that set an identity. Here's a crew before:

from crewai import Agent, Crew, Task
from crewai_tools import SerperDevTool, BraveSearchTool

researcher = Agent(
    role="researcher",
    tools=[SerperDevTool(), BraveSearchTool()],
)
writer = Agent(role="writer", tools=[])

crew = Crew(agents=[researcher, writer], tasks=[...])

And after:

from faramesh import GovernedToolSet
from crewai import Agent, Crew, Task

researcher_tools = GovernedToolSet(
    [SerperDevTool(), BraveSearchTool()],
    agent_id="research-crew/researcher",
)

researcher = Agent(role="researcher", tools=researcher_tools)
writer     = Agent(role="writer",     tools=[])

crew = Crew(agents=[researcher, writer], tasks=[...])

That's the whole integration. Use one GovernedToolSet per agent so each crew member gets its own identity in the policy. That's what lets you give the researcher and the writer different rules, which matters more than it sounds like, since in a crew the agents have genuinely different jobs and should have genuinely different permissions.

Write the rules, per role

Open governance.fms. Because each agent has its own id, you write a policy block per role. Here the researcher can search but nothing else, and the writer can't touch tools at all:

import "github.com/faramesh/faramesh-registry/frameworks/crewai@1.0.0"

agent "research-crew/researcher" {
  default deny

  rules {
    permit serper_search
    permit brave_search
  }

  rate_limit "*_search": 30 per minute

  budget daily {
    max       $20
    on_exceed defer
  }
}

agent "research-crew/writer" {
  default deny
  rules { }
}

A few things worth reading off that:

default deny means anything not explicitly allowed gets blocked. You opt tools in, you don't opt them out. Rules are checked top to bottom and the first match wins.

The rate_limit line caps both search tools at 30 calls a minute, so a confused agent can't hammer an API in a loop. The budget block puts a daily ceiling on spend and, when it's hit, defers instead of denying, the work pauses for a human rather than just dying. The writer's empty rule block plus default deny means it has no tool access at all, which is exactly right for an agent whose job is to write, not act.

Validate before you ship anything:

faramesh check
faramesh plan

check parses and type-checks the file. plan prints the exact decision diff, so you can see what changes before it's live.

Apply and run

Turn on enforcement and run the crew normally:

faramesh apply
python my_crew.py

A permit returns the tool result like nothing's there. A defer returns a structured response telling the agent its action is pending approval, the crew doesn't crash, the call just doesn't go through yet. You watch and clear the queue from another terminal:

faramesh approvals list
faramesh approvals approve apr-9001

Once approved, the agent's next attempt goes through. Or, if you've decided it should always be allowed, promote the rule to permit in the file and run faramesh apply again. One thing to know: apply is the only way to change the running policy. There's no quiet hot-reload where editing a file changes what your crew can do mid-run. You edit, you apply. It's deliberate on purpose.

Crews delegate, so the policy understands delegation

The thing that makes CrewAI CrewAI is agents handing work to each other. Faramesh models that directly. If your researcher delegates to your writer, you can bound what that delegation is allowed to carry:

agent "research-crew/researcher" {
  delegate {
    target_agent = "research-crew/writer"
    scope        = "read-only"
    ttl          = "5m"
  }
}

The daemon validates delegation against the crew's actual structure at runtime, so one agent can't quietly hand another a capability it wasn't granted. That's a failure mode specific to multi-agent setups, and it's nice to have it covered in the same file as everything else.

Why bother

Every decision the daemon makes also lands in a tamper-evident log you can verify offline with faramesh audit verify. That matters the day someone asks what your crew actually did and "I think the prompt told it not to" isn't a good enough answer.

None of this makes your agents smarter. It means the moments that carry real risk go through a deterministic rule you wrote and can read, instead of through luck. For a single agent that's useful. For a crew, where several agents are acting and delegating at once, it's the difference between a demo and something you'd leave running.

Faramesh is open source. The repo is at github.com/faramesh/faramesh-core if you want to poke around or break it. If you wire it into a crew and something's off, tell me. That's all super helpful feedback at this point.

Top comments (6)

Aljen M • Jun 22

Hello,

One of the biggest gaps in today's AI agent ecosystem is deterministic governance.
Prompt engineering and secondary LLM reviewers are not security models they're probabilistic safeguards.
Treating tool execution as a security boundary with explicit permissions, default-deny policies, human approvals, rate limits, budgets, delegation controls, and tamper-evident audit logs is a far more production-ready approach.
This is exactly the kind of governance layer autonomous and multi-agent systems need to move from impressive demos to trustworthy enterprise deployments.

Brian Hall • Jun 23

100% agree Aljen!

Aljen M • Jun 23

Thank you

Mike Czerwinski • Jun 22

The three-verdict architecture (permit/defer/deny) is the cleanest CrewAI-specific instantiation of what I've been writing as a status lifecycle on the decision-store side. Same shape, different artifact: defer = pending-confirm, permit = active, deny = locked-out. The thing that makes it work in both places is that the decision path doesn't depend on a model being in a good mood — your exact phrasing — and that comes from the policy file being readable, diff-able, and only mutable through an explicit apply step. File-as-interface plus operator-authored-only is the architectural shape this whole space seems to be converging on from multiple directions in the last week alone.

One operational concern worth naming: the policy is static but the call patterns it gates aren't. What catches the case where governance.fms itself becomes stale relative to actual usage? Examples that bite:

A permit rule for a tool that hasn't been called in three months — safe-low blast radius, but a standing permission no longer earning its keep. Could be demoted to defer with no real cost.
A defer queue accumulating without operator approvals — implies the rule needs to evolve into permit (operator is the bottleneck) or deny (the workflow doesn't need it).
A deny pattern hit repeatedly by the same agent on the same target — implies a legitimate use case the policy didn't anticipate.
Same shape as drift detection on decisions, applied one floor up to the policy itself. Curious if Faramesh exposes call-pattern telemetry that could feed back into "policy review" prompts for the operator, or whether the design treats the policy file as authored-and-forgotten until the operator explicitly revisits it.

Brian Hall • Jun 23

Yeah, the drift point makes sense. The stale permit one is real but pretty low stakes I think, a permit that isn't getting called isn't doing much either way, worst case it's just surface area sitting there. The defer queue and the repeated deny are the ones I'd actually act on.

Telemetry's already there for it. Faramesh logs the decision itself, not just the execution, so every one's in the hash-chained log with the action and context. What's missing is anything reading back over it to flag this stuff for you. That part's still on you to go look.

And it'd have to stay advisory anyway, since the policy only changes through apply. Most it could do is suggest a diff you actually apply.

Defer queue's the cleanest signal of the three though. Growing without approvals means the rule wants to be permit or deny, not much ambiguity there.

Thanks for the comment!

Mike Czerwinski • Jun 23

Defer queue as the clearest of the three holds up: a queue growing without approvals is one of the few unambiguous "the rule wants to be different" signals, because it's measurable in the same shape as the policy itself (count, age, repeat-key). Rate-of-growth-without-resolution is well-typed — no semantic guess about intent.

The read-back gap is where the design space sits. Hash-chained decision log gives you replayable substrate for free; what's missing is just the analyzer that walks it and surfaces diff suggestions. That keeps apply as the source of truth and the layer above as advisory — same shape as static analyzers over a git log vs auto-pushing commits.