Jason Shotwell

Posted on Mar 17

We Scanned 3 AI Frameworks for EU AI Act Compliance — Here's What We Found

#ai #python #opensource #euaiact

The EU AI Act enforcement deadline is August 2, 2026. Most AI teams know the regulation exists. Almost none know what it means for their codebase.

I built AIR Blackbox, an open-source CLI that scans Python AI projects for EU AI Act technical requirements. Think of it as a linter — but instead of checking code style, it checks whether your AI system has the technical infrastructure the regulation demands.

To validate whether the scanner actually works, I ran it against three real, production-grade frameworks: CrewAI, LangFlow, and Quivr. What I found surprised me — and broke my scanner in the process. Here's the full story.

What the Scanner Checks

The EU AI Act defines technical requirements across several articles. The scanner maps these to concrete code patterns:

Article 9 — Risk Management: Error handling, fallback patterns, circuit breakers
Article 10 — Data Governance: Input validation, PII detection, schema enforcement
Article 11 — Technical Documentation: Docstrings, type hints, README files
Article 12 — Record-Keeping: Logging, tracing, audit trails (OpenTelemetry, Langfuse, etc.)
Article 14 — Human Oversight: HITL approval gates, kill switches, execution budgets
Article 15 — Accuracy & Security: Prompt injection defense, output validation, retry logic

Each article gets a PASS, WARN, or FAIL based on what the scanner finds in your actual source code.

The Results

Here's the comparison table across all three frameworks:

Article	CrewAI	LangFlow	Quivr
Art. 9 — Risk Management	⚠️ WARN	⚠️ WARN	⚠️ WARN
Art. 10 — Data Governance	⚠️ WARN	⚠️ WARN	⚠️ WARN
Art. 11 — Technical Docs	✅ PASS	✅ PASS	✅ PASS
Art. 12 — Record-Keeping	✅ PASS	✅ PASS	⚠️ WARN
Art. 14 — Human Oversight	✅ PASS	✅ PASS	⚠️ WARN
Art. 15 — Accuracy & Security	✅ PASS	✅ PASS	⚠️ WARN
Total Passing	4/6	4/6	1/6

A few things stand out.

CrewAI Has Real Human Oversight

CrewAI ships a @human_feedback decorator — a 560-line module dedicated to human-in-the-loop approval. Combined with their Fingerprint identity system and AgentCard for A2A identity, it's the most compliance-ready architecture I've scanned. The OpenTelemetry integration with 72 event files gives you genuine audit trail infrastructure.

LangFlow Has the Strongest Security Story

LangFlow includes a GuardrailsComponent, explicit prompt injection detection, SSRF blocking, and Fernet encryption. Their tracing story is also strong — they support 8 different tracing backends. If you're building on LangFlow and need to prove Article 15 compliance, the infrastructure is already there.

Quivr Has a Solid Foundation But Gaps

Quivr's Langfuse integration is real — LangfuseService wrapping LLM calls with trace_id, user_id, and session_id. But with only 1 action audit file alongside the Langfuse traces, record-keeping stays at WARN. Type hint coverage (79%) is above average and earns it the only PASS. But there's no HITL pattern anywhere, no prompt injection defense, and PII handling is limited to Pydantic schema validation. For a RAG framework that processes user documents, those gaps are worth flagging.

How the Scanner Broke (and Why That's the Point)

Here's the part that doesn't make it into marketing slides: the first time I ran these benchmarks, the scanner was wrong.

The problem was false positives. The rule-based scanner was too lenient — it was counting patterns that look like compliance but aren't:

user_id appearing in 2 files was enough to PASS human oversight. That's like saying a building is fire-safe because it has a doorknob.
max_iterations counted as token security. It's a loop limiter, not a cost control.
sanitize matched sanitize_filename — a file utility, not an injection defense.
Bare pii matched inside the word api. Regex without word boundaries.

Quivr's initial scan came back with a PASS on Human Oversight based on nothing more than user_id appearing in two files. A CTO looking at that result would — correctly — dismiss the entire tool.

The Fix: Strong vs. Weak Patterns

I rewrote 5 check functions to separate strong signals from weak signals:

# Before: "user_id" alone = PASS
# After: requires delegation context
strong_patterns = [
    r'authorized_by', r'delegated_by', r'on_behalf_of',
    r'delegation_token', r'agent_identity',
    r'Fingerprint', r'AgentCard',  # CrewAI-specific
]
moderate_patterns = [
    r'(?:agent|llm|crew|chain|pipeline).*user_id',
    r'user_id.*(?:agent|llm|crew|chain|pipeline)',
]

Strong patterns (dedicated security libraries, explicit delegation tokens) trigger a PASS. Weak patterns (generic config params that happen to exist in most codebases) trigger a WARN at most. This means every framework that passes actually has the infrastructure — not just coincidental variable names.

Proportional Scoring

I also replaced the flat threshold (passes >= 2 = article passes) with proportional scoring:

if s_passes > 0 and s_passes >= s_fails and (s_passes / s_total) >= 0.4:
    overall = "pass"

An article now needs at least 40% of its checks passing AND more passes than fails. This stopped 2/8 passing checks from carrying an entire article.

The Hybrid Approach: Rule-Based + Fine-Tuned Model

The scanner has two engines working together. The rule-based engine scans every file in the codebase using regex patterns. A fine-tuned AI model (running locally via Ollama — your code never leaves your machine) analyzes a smart sample of compliance-relevant files and provides deeper analysis.

The problem: the model only sees ~5 files and ~12KB of code. For a framework like CrewAI with 1,900+ Python files, it misses things the rule-based scanner catches. So I built smart reconciliation:

Model says FAIL, but rule-based found 2+ passing checks → Override to PASS
Model says FAIL, but rule-based found 1 passing check → Upgrade to WARN

This gives you the breadth of regex scanning with the depth of AI analysis. After tightening the rules and adding reconciliation, validation accuracy hit:

CrewAI: 100% (6/6 articles match expected)
LangFlow: 83% (5/6 match)
Quivr: 83% (5/6 match)

Sharing Results With Framework Teams

The point of benchmarking isn't to rank frameworks — it's to validate the scanner and give useful feedback. I've been opening GitHub issues on each framework's repo sharing scan results and asking three questions:

Did we miss something? If there's compliance infrastructure our scanner didn't catch, that's a bug in our sampling logic.
Is our assessment fair? A WARN on risk management might be by design if the framework delegates that to the application layer.
Is this useful for your roadmap? August 2026 is coming. If this data helps prioritize, great.

The Haystack team at deepset already gave feedback that helped us fix false positives. We're hoping for similar conversations with CrewAI, LangFlow, and Quivr.

Try It Yourself

The scanner is open-source and runs entirely on your machine:

pip install air-blackbox
air-blackbox setup          # pulls the local AI model
air-blackbox comply --scan /path/to/your/project -v --deep

10 seconds. 6 technical checks mapped to the actual EU AI Act articles. No cloud. No API keys. The AI model runs locally through Ollama.

If you're building on any Python AI framework and want to know where you stand before August 2026, this is a good starting point. It's a linter, not a lawyer — it checks technical requirements, not legal compliance.

GitHub: github.com/airblackbox/gateway
PyPI: air-blackbox
Website: airblackbox.ai

If you scan your project and find something wrong with the results, open an issue. Every bug report makes the scanner better — that's literally what happened with this benchmark.

DEV Community