Jason Shotwell

Posted on Feb 23

I Built a Flight Recorder for AI Agents — Here's What It Catches

#ai #machinelearning #opensource #python

AIR Blackbox is an open-source observability layer that records, replays, and enforces policies on every LLM call your AI agents make. Try the live demo.

Your AI agent just sent an email, called an API, or moved money. Someone asks: "Show me exactly what it saw and why it made that decision."

Can you answer that today?

If you're running AI agents in production — LangChain chains, CrewAI crews, OpenAI function calling, AutoGen teams — you've probably hit this wall. The agent did something. Logs show fragments. Token spend spiked. But there's no complete record of the full decision chain.

We built AIR Blackbox to fix that. It's the flight recorder for autonomous AI agents — like the black box on an airplane, but for your LLM calls.

Try it right now (no install needed): Live Interactive Demo

What Does the Demo Show?

The hosted demo lets you run four scenarios and watch what happens inside the AIR Blackbox pipeline in real time:

Scenario 1: Normal Request

A standard agent request flows through the system. You'll see:

The Gateway intercepts the call and assigns a trace ID
The Policy Engine evaluates it (rate limit, budget cap, tool restrictions)
The LLM responds
The OTel Collector captures cost, latency, and PII scan results
The Episode Store records the full interaction as a replayable episode

Everything green. This is the happy path.

Scenario 2: Runaway Loop 🔴

This is the one that saves you money. An agent gets stuck making the same request over and over — "Check order status #4521" five times in a row.

The OTel Collector detects the repeated pattern at request 3. By request 4, it triggers the kill switch. Request 5 gets blocked before it ever reaches the LLM.

Estimated savings in the demo: $47. In production, we've seen runaway agents burn through hundreds of dollars in minutes.

Scenario 3: PII in Prompt

An agent sends a prompt containing an email address, a Social Security number, and a credit card number. This happens more often than you'd think — agents pulling data from databases or CRMs and stuffing it into prompts.

The OTel Collector detects all three PII fields and redacts them before the trace reaches your observability backend. The redacted fields are hashed so you can still correlate across traces without storing raw sensitive data.

Scenario 4: Dangerous Tool Call

An agent tries to execute rm -rf /var/data/* via a tool call. The Policy Engine blocks it instantly — the tool is on the restricted list and it's a destructive filesystem operation. The request never reaches the LLM. Cost: $0.00.

How It Works

AIR Blackbox sits between your AI agent and the LLM provider as an OpenAI-compatible proxy:

Your Agent → Gateway → Policy Engine → LLM Provider
                ↓              ↓
          OTel Collector   Episode Store

The Gateway (Go) intercepts every LLM call and produces structured OpenTelemetry traces. It's OpenAI-compatible, so your agents don't need code changes — just point them at a different base URL.

The Policy Engine (Python/FastAPI) evaluates every request against your rules in real time. Rate limits, budget caps, tool restrictions, content matching — all configurable.

The OTel Collector runs custom processors for:

PII redaction — scrub sensitive data before it hits your trace backend
Semantic normalization — consistent attribute names across providers
Cost tracking — per-request and cumulative spend
Loop detection — catch runaway agents before they drain your budget

The Episode Store (Python/FastAPI) groups raw traces into task-level episodes. Think of it like a DVR for your agent — you can rewind and replay exactly what happened during an incident.

Get Running in 5 Minutes

git clone https://github.com/airblackbox/gateway.git
cd gateway
cp .env.example .env   # add your OPENAI_API_KEY
docker compose up --build

pip install air-blackbox-sdk

from openai import OpenAI
import air

client = air.air_wrap(OpenAI())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Summarize Q4 revenue"}],
)
# Every call is now recorded with a full audit trail

That's it. One wrapper function. Your agent code stays the same.

Framework Integrations

AIR Blackbox works with whatever you're already using:

# LangChain
from air.integrations.langchain import air_langchain_llm
llm = air_langchain_llm("gpt-4o-mini")

# CrewAI
from air.integrations.crewai import air_crewai_llm
llm = air_crewai_llm("gpt-4o-mini")

# OpenAI Agents SDK
from air.integrations.openai_agents import air_openai_agents_provider
provider = air_openai_agents_provider()

# AutoGen
from air.integrations.autogen import air_autogen_config
config = air_autogen_config("gpt-4o-mini")

What It's Not

AIR Blackbox is not an agent framework. It doesn't build agents, orchestrate tasks, or manage prompts. It's infrastructure — the observability and governance layer for teams that already have agents running and need to answer:

What did the agent do?
Why did it make that decision?
Did it leak any sensitive data?
How much did it cost?
Can I replay the incident?

The Stack

Component	Language	What It Does
Gateway	Go	OpenAI-compatible proxy, OTel trace emission
Policy Engine	Python	Real-time policy evaluation, kill switches
Episode Store	Python	Trace → episode grouping, replay
OTel Collector	Go	PII redaction, cost metrics, loop detection
Python SDK	Python	`air_wrap()` + framework integrations
Platform	Docker	One-command full stack deployment

22 repos. 700+ tests. CI on every push. Apache-2.0.

Try the Demo

Launch the interactive demo →

Run all four scenarios. Watch the traces light up. Then clone the repo and try it with your own agents.

If you have questions about the architecture, the OTel pipeline, or how to write custom policies, drop a comment or open a discussion on GitHub.

AIR Blackbox is open-source under Apache 2.0. Star us on GitHub if this is useful to you.

Top comments (1)

Armorer Labs • May 14

This matches what we're seeing in the field: the hard part isn't getting an agent to work once. It's knowing what happened when it ran in production — especially the tool calls, args, and partial failures.

A flight recorder pattern is the right instinct. I'd add: the record should include whether a tool call was allowed, blocked, or required approval — tied back to the run record.

This is the angle we're building Armorer arThis matches what we're seeing in the field: the hard part isn't getting an agent to work once. It's knowing what happened when it ran in production — especially the tool calls, args, and partial failures.

A flight recorder pattern is the right instinct. I'd add: the record should include whether a tool call was allowed, blocked, or required approval — tied back to the run record.

This is the angle we're building Armorer around. Not just traces — run records that include the action boundary: what tool was called, with what args, whether it succeeded, and what the verifier said.ound. Not just traces — run records that include the action boundary: what tool was called, with what args, whether it succeeded, and what the verifier said.