My AI agent system runs 16 teams across 4 different LLM providers. Two months ago, one team silently started hallucinating policy decisions. I caught it in 11 minutes.
Not with Datadog. Not with Honeycomb. With 47 lines of Python writing to a SQLite database.
OpenTelemetry is now working on semantic conventions for LLM tracing. That's great. But I needed this six months ago, so I built my own. Here's the full setup.
TL;DR: A SQLite-backed audit trail for multi-agent AI orchestration logs every LLM call, model routing decision, and bias detection event. 338 audit entries and 108 events exposed 3 silent failures that cost-based monitoring would have missed entirely. The system is 4 tables, runs on a 1GB Oracle Cloud free-tier instance, and replaced what would have been ~$200/month in observability tooling. Total implementation time: one weekend.
The Problem: Flying Blind With Multiple Models
Running one model is simple — you read the output. Running four different LLM models in a single orchestration pipeline creates a debugging problem that single-model setups never encounter. I route tasks across Claude Opus (implementation), Gemini 3.1 (information synthesis), GPT-5.4 (strategy reviews), and Codex (parallel task execution), organized into 16 agent teams that hand work off to each other sequentially.
With four models and 16 teams, you need answers to questions that print() can’t help with:
- Which model handled which step
- How long each step took
- Whether the output was actually used or silently dropped
- What the routing decision was based on
I evaluated existing options before building my own. LangSmith Teams: ~$400/month. Self-hosted Langfuse: requires a Postgres instance with 2GB+ RAM. OpenTelemetry's GenAI semantic conventions: still experimental, no production deployment story.
I needed something that worked today, on a server with 1GB of RAM and a $0 infrastructure budget.
Step 1: Four Tables, One Database
The minimum viable schema for multi-agent observability is four tables: one for raw call logs, one for system events, one for routing decisions, and one for agent memory. The entire schema fits in your head — and that constraint is a feature, not a limitation.
CREATE TABLE audit_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT DEFAULT (datetime('now')),
action TEXT NOT NULL,
agent TEXT,
model TEXT,
input_summary TEXT,
output_summary TEXT,
latency_ms INTEGER,
tokens_in INTEGER,
tokens_out INTEGER,
cost_usd REAL,
metadata TEXT -- JSON blob for anything else
);
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT DEFAULT (datetime('now')),
event_type TEXT NOT NULL,
source TEXT,
urgency TEXT CHECK(urgency IN ('low','medium','high','critical')),
project TEXT,
payload TEXT -- JSON
);
CREATE TABLE decisions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT DEFAULT (datetime('now')),
decision_type TEXT,
context TEXT,
outcome TEXT,
confidence REAL,
model TEXT
);
CREATE TABLE memories (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT DEFAULT (datetime('now')),
memory_type TEXT CHECK(memory_type IN
('episodic','semantic','procedural','project')),
content TEXT,
source TEXT,
relevance_score REAL
);
Four tables. No migrations framework. No ORM. **SQLite's WAL mode handles concurrent reads from the dashboard while agents write logs** — that's all the concurrency management this pattern needs.
python
import sqlite3
DB_PATH = "~/.claude/jarvis/data/jarvis.db"
def get_db():
conn = sqlite3.connect(DB_PATH)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA busy_timeout=5000")
conn.row_factory = sqlite3.Row
return conn
The busy_timeout is critical. Without it, concurrent agent writes will throw database is locked errors. 5 seconds is generous — in practice, writes complete in under 10ms.
Step 2: The Event Bus Pattern
Agents shouldn't know they're being traced. Every agent action flows through a central event bus that routes to logging subscribers independently of the main pipeline.
class ControlPlane:
def emit(self, event_type: str, source: str,
urgency: str, project: str, payload: dict):
"""Every agent action goes through here."""
db = get_db()
# Log the event
db.execute(
"INSERT INTO events (event_type, source, urgency, project, payload) "
"VALUES (?, ?, ?, ?, ?)",
(event_type, source, urgency, project, json.dumps(payload))
)
# Route to subscribers
for subscriber in self._subscribers.get(event_type, []):
subscriber(event_type, source, payload)
db.commit()
**The subscriber pattern is where this gets powerful.** The bias firewall subscribes to all `agent_output` events. The cost tracker subscribes to `model_call` events. The dashboard subscribes to everything. None of them block the agent pipeline — if a subscriber fails, the main pipeline continues and the failure is logged separately.
python
Bias firewall subscribes non-invasively
control_plane.subscribe("agent_output", bias_firewall.check)
control_plane.subscribe("model_call", cost_tracker.record)
control_plane.subscribe("*", dashboard.update)
Step 3: Model Routing With a Paper Trail
Every routing decision gets logged, not just executed. This was the single most valuable architectural choice in the entire system — when something goes wrong, "which model handled this?" becomes a SQL query instead of a debugging session.
class ModelRouter:
ROUTING_TABLE = {
"implementation": {"model": "claude-opus-4-6", "role": "builder"},
"synthesis": {"model": "gemini-3.1-pro", "role": "researcher"},
"strategy": {"model": "gpt-5.4", "role": "chief_of_staff"},
"parallel_exec": {"model": "codex", "role": "worker"},
}
def route_task(self, task_type: str) -> dict:
route = self.ROUTING_TABLE[task_type]
# Log the routing decision
db = get_db()
db.execute(
"INSERT INTO decisions (decision_type, context, outcome, model) "
"VALUES (?, ?, ?, ?)",
("model_routing", task_type, route["role"], route["model"])
)
db.commit()
return route
After two months, I had 343 routing decisions logged. **One `GROUP BY` told a story I never expected: **23% of "strategy" tasks were actually implementation tasks mis-categorized by the dispatcher.** GPT-5.4 was doing Claude's job. Poorly. Nobody noticed until the audit trail made it obvious.
sql
SELECT decision_type, model, COUNT(*) as count,
AVG(confidence) as avg_confidence
FROM decisions
GROUP BY decision_type, model
ORDER BY count DESC;
Step 4: Bias Detection Latency Tracking
A bias firewall that cross-checks agent outputs using multiple models is only useful if you can measure its performance. I run a 4-stage pipeline: Risk Classifier → Claim Extractor → Cross-Model Verifier (Gemini 3.1) → Disagreement Preserver. Each stage gets its own audit entry with latency and detection metadata.
class BiasFirewall:
def check(self, event_type, source, payload):
start = time.monotonic()
risk = self.classify_risk(payload)
claims = self.extract_claims(payload)
verification = self.cross_verify(claims) # Gemini call
elapsed_ms = int((time.monotonic() - start) * 1000)
db = get_db()
db.execute(
"INSERT INTO audit_log "
"(action, agent, model, latency_ms, metadata) "
"VALUES (?, ?, ?, ?, ?)",
(
"bias_check",
source,
"gemini-3.1-pro-preview",
elapsed_ms,
json.dumps({
"risk_level": risk,
"claims_found": len(claims),
"disagreements": verification.get("disagreements", []),
"detection_type": verification.get("bias_type"),
})
)
)
db.commit()
**Real numbers from my audit trail**: the Gemini 3.1 Pro verifier averages 22 seconds per check; the Flash Lite fallback averages 3 seconds. **Detection rate: 100% across 6 bias types** (framing, false consensus, anchoring, availability, confirmation, authority).** I know this because every check is a row in `audit_log`, not a claim in a README.
sql
SELECT
json_extract(metadata, '$.detection_type') as bias_type,
COUNT(*) as detections,
AVG(latency_ms) as avg_latency_ms,
MIN(latency_ms) as min_latency_ms,
MAX(latency_ms) as max_latency_ms
FROM audit_log
WHERE action = 'bias_check'
GROUP BY bias_type;
Step 5: The Dashboard That Costs Nothing
A terminal dashboard (Textual TUI) refreshing every 5 seconds feeds entirely from the SQLite audit trail. Panels for active agents, recent decisions, bias firewall status, and cost tracking. The key architectural property: the dashboard is a pure read consumer — it never writes to the database, and its queries run in under 2ms on a 1GB server.
def get_recent_activity(minutes: int = 30) -> list:
db = get_db()
return db.execute(
"SELECT action, agent, model, latency_ms, timestamp "
"FROM audit_log "
"WHERE timestamp > datetime('now', ?) "
"ORDER BY timestamp DESC LIMIT 50",
(f"-{minutes} minutes",)
).fetchall()
def get_cost_summary(days: int = 7) -> dict:
db = get_db()
rows = db.execute(
"SELECT model, SUM(cost_usd) as total, COUNT(*) as calls "
"FROM audit_log "
"WHERE timestamp > datetime('now', ?) AND cost_usd IS NOT NULL "
"GROUP BY model",
(f"-{days} days",)
).fetchall()
return {r["model"]: {"cost": r["total"], "calls": r["calls"]} for r in rows}
No Grafana. No Prometheus. No InfluxDB. No server process to manage. **The monitoring layer costs $0 to run and requires zero infrastructure beyond the database file that already exists.**
## Step 6: The Silent Failure That Proved the System
Three weeks after deployment, the audit trail caught something no cost-based monitor would have flagged: a strategy review team was producing outputs, but the downstream implementer was ignoring them. Entirely. No errors. No alerts. Just silent data loss.
**The `events` table showed `agent_output` events from the strategy model, but the corresponding `decisions` table had zero `implementation_start` entries referencing those outputs** — a structural gap visible only because both sides of every handoff were being logged independently.
sql
-- Find strategy outputs with no downstream pickup
SELECT e.id, e.timestamp, e.source,
json_extract(e.payload, '$.task_id') as task_id
FROM events e
WHERE e.event_type = 'agent_output'
AND e.source = 'strategy-reviewer'
AND json_extract(e.payload, '$.task_id') NOT IN (
SELECT context FROM decisions
WHERE decision_type = 'implementation_start'
);
Seven tasks over four days. Silently dropped. The implementer agent was receiving the handoff but failing to parse a changed output format from a model update. The audit trail caught it because it tracked both sides of every handoff independently — something a single-stream log would have missed entirely.
What I'd Do Differently
Start with the decisions table, not the audit_log. The audit log is useful for latency and cost analysis, but the decisions table is where you find actual bugs. If I'd built the decisions table first, I would have caught the strategy-implementation gap in days, not weeks.
Add a correlation_id from day one. Tracing a single request across 5 agents currently means joining on timestamps and task IDs. A single UUID per pipeline run would save hours of forensic querying. This is the one thing OpenTelemetry's trace/span model gets absolutely right.
Don't log raw prompts. I initially stored full prompts in input_summary. The database hit 400MB in a week. Summaries and token counts are sufficient for debugging 95% of issues. Store raw data only when you're actively investigating a specific failure.
The Numbers
Cost Comparison
| Approach | Monthly Cost | Setup Time | RAM Required |
|---|---|---|---|
| LangSmith Teams | ~$400 | 2 hours | N/A (hosted) |
| Self-hosted Langfuse | ~$50 (Postgres) | 4–6 hours | 2GB+ |
| Datadog LLM Observability | ~$200 | 1 hour | N/A (hosted) |
| SQLite audit trail | $0 | ~8 hours | <50MB |
System Metrics (after 2 months of production use)
| Metric | Value |
|---|---|
| Audit log entries | 338 |
| Events recorded | 108 |
| Routing decisions tracked | 343 |
| Memory entries | 1,740 |
| Database size (WAL mode) | ~12MB |
| Average write latency | <10ms |
| Silent failures caught | 3 |
| Server specs | 1 OCPU / 1GB RAM (Oracle Cloud Free Tier) |
| Duplicate decisions cleaned (one-time) | 9,486 |
| Bias detection rate | 100% (6/6 types) |
That last row — 9,486 duplicate decisions requiring a one-time cleanup — is what happens when you skip dedup logic early. Add a UNIQUE constraint on (decision_type, context, outcome, timestamp) before you have 10,000 rows to clean up. Learn from my mistake.
FAQ
Why SQLite instead of Postgres?
SQLite runs embedded — no server process, no connection pooling, no port management. On a 1GB instance running 4 concurrent LLM API clients, a Postgres process consuming 200MB of shared buffers is a real constraint. SQLite in WAL mode handles the specific read/write pattern here (dashboard reads while agents write) without contention. The database file is also trivially portable — I scp it locally for deep analysis when needed.
How does this compare to OpenTelemetry's GenAI semantic conventions?
OpenTelemetry's gen_ai.* attributes cover the model call layer well: model name, token counts, latency, finish reason. What they don't cover yet is the orchestration layer — routing decisions, cross-model verification, agent-to-agent handoffs, memory retrieval. This audit trail captures both layers. When OTel's conventions stabilize, the migration path is straightforward: emit OTel spans from the same event bus while keeping the SQLite trail for orchestration-specific data that OTel doesn't address.
Won't this break at scale?
SQLite handles databases up to 281TB. Two months of multi-agent orchestration data in this system is 12MB. At the current write rate (~170 entries/month across all tables), hitting 1GB would take roughly 50 years. If you're running 10,000 agents, switch to Postgres. If you're running 16 agent teams like this setup, SQLite will outlast the project.
How do you query across related tables?
Standard SQL joins, with the advantage that everything is in one database file — no cross-service queries, no distributed tracing backends, no network latency.
SELECT a.action, a.agent, a.latency_ms,
d.decision_type, d.outcome,
e.event_type, e.urgency
FROM audit_log a
LEFT JOIN decisions d ON d.timestamp
BETWEEN datetime(a.timestamp, '-1 second')
AND datetime(a.timestamp, '+1 second')
LEFT JOIN events e ON json_extract(e.payload, '$.task_id') =
json_extract(a.metadata, '$.task_id')
WHERE a.timestamp > datetime('now', '-1 day')
ORDER BY a.timestamp DESC;
The timestamp-based join is crude — a correlation ID would be cleaner. But it works without schema changes.
### What about data retention?
A monthly cleanup archives entries older than 90 days to a compressed backup and removes them from the active database. Three months of data compresses to roughly 800KB.
bash
sqlite3 jarvis.db ".dump" | gzip > "archive_$(date +%Y%m).sql.gz"
sqlite3 jarvis.db "DELETE FROM audit_log WHERE timestamp < datetime('now', '-90 days')"
sqlite3 jarvis.db "VACUUM"
Try It Yourself
You can set this up in under an hour. Here’s the path I’d recommend:
Create the database. Copy the four
CREATE TABLEstatements above intoschema.sql. Runsqlite3 jarvis.db < schema.sql.Enable WAL mode. Run
sqlite3 jarvis.db "PRAGMA journal_mode=WAL"once. This persists across all future connections.Instrument one agent call. Pick your most critical LLM call. Add a single
INSERT INTO audit_logafter it returns with the model name, latency in milliseconds, and token counts. That's the entire day-one requirement.Add the event bus. Wrap agent calls in an
emit()function. Subscribe your logging to it. Agents and observability are now decoupled.Write one diagnostic query. Answer a question you've been guessing at: "Which model is slowest?" or "How many calls per day?" The first useful answer will validate the entire approach.
Add the decisions table last. Once calls and events are flowing, start logging routing decisions explicitly. This is where production bugs actually live.
The Honest Take
This system is not elegant. It's a SQLite file with four tables and Python glue code. No distributed tracing. No visualization layer. No SLA. The cross-table joins use BETWEEN clauses and timestamp proximity as a substitute for proper correlation IDs.
But it caught three silent failures that would have been invisible to cost-based monitoring. It runs on a free-tier cloud instance consuming under 50MB of RAM. And when OpenTelemetry's GenAI conventions reach production stability, the migration path is clear: emit OTel spans from the same event bus and keep the SQLite trail for orchestration-layer data that OTel doesn't yet model.
Sometimes the right observability tool is the one you can build in a weekend, understand completely, and trust at 3am when something breaks.
Have you tried tracing multi-model AI systems? I'd genuinely like to know — did anyone else go the "just use SQLite" route, or did you find a lightweight alternative that worked?
If this saved you from evaluating a $400/mo tracing platform, drop a bookmark. I'm writing a follow-up on the bias firewall pipeline — specifically how cross-model verification between Claude and Gemini catches subtle framing issues that single-model review misses consistently.
Follow for more AI agent infrastructure — real systems, real numbers
Top comments (0)