AIR Blackbox is an open-source observability layer that records, replays, and enforces policies on every LLM call your AI agents make. Try the live demo.
Your AI agent just sent an email, called an API, or moved money. Someone asks: "Show me exactly what it saw and why it made that decision."
Can you answer that today?
If you're running AI agents in production — LangChain chains, CrewAI crews, OpenAI function calling, AutoGen teams — you've probably hit this wall. The agent did something. Logs show fragments. Token spend spiked. But there's no complete record of the full decision chain.
We built AIR Blackbox to fix that. It's the flight recorder for autonomous AI agents — like the black box on an airplane, but for your LLM calls.
Try it right now (no install needed): Live Interactive Demo
What Does the Demo Show?
The hosted demo lets you run four scenarios and watch what happens inside the AIR Blackbox pipeline in real time:
Scenario 1: Normal Request
A standard agent request flows through the system. You'll see:
- The Gateway intercepts the call and assigns a trace ID
- The Policy Engine evaluates it (rate limit, budget cap, tool restrictions)
- The LLM responds
- The OTel Collector captures cost, latency, and PII scan results
- The Episode Store records the full interaction as a replayable episode
Everything green. This is the happy path.
Scenario 2: Runaway Loop 🔴
This is the one that saves you money. An agent gets stuck making the same request over and over — "Check order status #4521" five times in a row.
The OTel Collector detects the repeated pattern at request 3. By request 4, it triggers the kill switch. Request 5 gets blocked before it ever reaches the LLM.
Estimated savings in the demo: $47. In production, we've seen runaway agents burn through hundreds of dollars in minutes.
Scenario 3: PII in Prompt
An agent sends a prompt containing an email address, a Social Security number, and a credit card number. This happens more often than you'd think — agents pulling data from databases or CRMs and stuffing it into prompts.
The OTel Collector detects all three PII fields and redacts them before the trace reaches your observability backend. The redacted fields are hashed so you can still correlate across traces without storing raw sensitive data.
Scenario 4: Dangerous Tool Call
An agent tries to execute rm -rf /var/data/* via a tool call. The Policy Engine blocks it instantly — the tool is on the restricted list and it's a destructive filesystem operation. The request never reaches the LLM. Cost: $0.00.
How It Works
AIR Blackbox sits between your AI agent and the LLM provider as an OpenAI-compatible proxy:
Your Agent → Gateway → Policy Engine → LLM Provider
↓ ↓
OTel Collector Episode Store
The Gateway (Go) intercepts every LLM call and produces structured OpenTelemetry traces. It's OpenAI-compatible, so your agents don't need code changes — just point them at a different base URL.
The Policy Engine (Python/FastAPI) evaluates every request against your rules in real time. Rate limits, budget caps, tool restrictions, content matching — all configurable.
The OTel Collector runs custom processors for:
- PII redaction — scrub sensitive data before it hits your trace backend
- Semantic normalization — consistent attribute names across providers
- Cost tracking — per-request and cumulative spend
- Loop detection — catch runaway agents before they drain your budget
The Episode Store (Python/FastAPI) groups raw traces into task-level episodes. Think of it like a DVR for your agent — you can rewind and replay exactly what happened during an incident.
Get Running in 5 Minutes
git clone https://github.com/airblackbox/gateway.git
cd gateway
cp .env.example .env # add your OPENAI_API_KEY
docker compose up --build
pip install air-blackbox-sdk
from openai import OpenAI
import air
client = air.air_wrap(OpenAI())
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize Q4 revenue"}],
)
# Every call is now recorded with a full audit trail
That's it. One wrapper function. Your agent code stays the same.
Framework Integrations
AIR Blackbox works with whatever you're already using:
# LangChain
from air.integrations.langchain import air_langchain_llm
llm = air_langchain_llm("gpt-4o-mini")
# CrewAI
from air.integrations.crewai import air_crewai_llm
llm = air_crewai_llm("gpt-4o-mini")
# OpenAI Agents SDK
from air.integrations.openai_agents import air_openai_agents_provider
provider = air_openai_agents_provider()
# AutoGen
from air.integrations.autogen import air_autogen_config
config = air_autogen_config("gpt-4o-mini")
What It's Not
AIR Blackbox is not an agent framework. It doesn't build agents, orchestrate tasks, or manage prompts. It's infrastructure — the observability and governance layer for teams that already have agents running and need to answer:
- What did the agent do?
- Why did it make that decision?
- Did it leak any sensitive data?
- How much did it cost?
- Can I replay the incident?
The Stack
| Component | Language | What It Does |
|---|---|---|
| Gateway | Go | OpenAI-compatible proxy, OTel trace emission |
| Policy Engine | Python | Real-time policy evaluation, kill switches |
| Episode Store | Python | Trace → episode grouping, replay |
| OTel Collector | Go | PII redaction, cost metrics, loop detection |
| Python SDK | Python |
air_wrap() + framework integrations |
| Platform | Docker | One-command full stack deployment |
22 repos. 700+ tests. CI on every push. Apache-2.0.
Try the Demo
Run all four scenarios. Watch the traces light up. Then clone the repo and try it with your own agents.
If you have questions about the architecture, the OTel pipeline, or how to write custom policies, drop a comment or open a discussion on GitHub.
AIR Blackbox is open-source under Apache 2.0. Star us on GitHub if this is useful to you.
Top comments (0)