DEV Community

Waqar Javed
Waqar Javed

Posted on

I just published an open-source framework for red-teaming AI agents.

Not LLM chatbots — agents. The kind built on LangChain, CrewAI, AutoGPT-style architectures that use tools, call APIs, and take multi-step actions in the world.

Here's the problem I kept running into: teams are shipping agentic systems to production, but the red-teaming tooling hasn't kept up. Most evaluation frameworks still treat agents like chatbots. They miss the failure modes that actually matter — prompt injection through tool outputs, scope violations across reasoning steps, behavioral drift under adversarial conditions.

So I built AgentSafeLabs.

You wrap your agent in one function call. It runs a test suite aligned to the OWASP Agentic Security Initiative Top 10 — the emerging standard for agentic AI security. You get structured results: PASS, FAIL, UNCERTAIN, with reproducible test cases.

Real example from this week: We ran AgentSafeLabs against Claude Haiku as the target agent passed 2 of 3 ASI01 (prompt injection) tests. The third returned UNCERTAIN — an indirect injection through a benign-looking context prefix that partially redirected tool selection. That's the kind of edge case that doesn't show up in standard evals.

It's MIT licensed, on PyPI, CI-verified, and actively being extended.

pip install safelabs-eval

GitHub: https://github.com/AgentSafeLabs/safelabs-eval

If you're building agents and you've hit unexpected failure modes — I'd like to hear about them. And if you know someone this would be useful for, a share goes a long way for an early OSS project.

AIAgents #AISecurity #RedTeaming #AgenticAI #PromptInjection #LLMSecurity #OpenSourceAI

Top comments (0)