Nilofer 🚀

Posted on Jun 23

Context Compaction Visualizer: See Exactly What Your AI Agent Forgot Before It Costs You

#llm #opensource #machinelearning #claude

When an AI agent runs for many turns, it eventually hits context limits and must compress or discard earlier messages. This is often invisible, yet critical - lost context can cause the agent to forget constraints, user preferences, or prior decisions. The framework moves on. The agent keeps running. And somewhere in those discarded turns is a security finding, a constraint, a decision that the rest of the session quietly proceeds without.

Context Compaction Visualizer makes that process visible - not after something breaks, but as an inspectable artifact of every run. It is a visualization platform that helps teams understand how long-running AI agents manage and compress context over time - upload execution traces from LangSmith, OpenTelemetry, AgentOps, or any custom format, and explore exactly which context was retained, compressed, or discarded, and at what cost.

What This Platform Does

The core problem is that compaction happens inside the framework's internals. There is no standard output that tells you which messages survived, which were summarized, and which were dropped - or what any of that cost in tokens. This platform reconstructs that picture from execution traces.

A trace file is uploaded with a format selected, and the platform rebuilds the full session: every message at every turn, its fate - retained verbatim, summarized, or discarded - and any compaction events that occurred along the way. A D3.js stacked-bar timeline renders token consumption across all turns with color-coded regions for each outcome. A session replay steps through turn by turn, surfacing a diff at the exact point a compaction event fires. Token analytics compute the total cost and compression efficiency of the session. A Claude-powered information loss detector scores the risk of each compaction event and names specifically what may have been lost.

When two traces are available - two different agents, or the same agent under two different compaction strategies - a comparative view places them side by side to show which preserved more context at lower cost.

Quick Start

Prerequisites

Python 3.11 or newer
Node.js 18 or newer
An Anthropic API key (optional - only the Info Loss Detection feature needs it)

Set up the environment

cp .env.example .env
# Optionally add ANTHROPIC_API_KEY=sk-ant-your-key-here for info loss detection

Run the backend

cd backend
pip install -r requirements.txt
uvicorn main:app --reload --host 0.0.0.0 --port 8000

The API starts at http://localhost:8000. Interactive docs are available at http://localhost:8000/docs.

Run the frontend

The frontend runs in a separate process:

cd frontend
npm install
npm run dev

The UI opens at http://localhost:5173.

Run with Docker

cp .env.example .env
docker compose up --build

The backend runs on port 8000, the frontend serves via nginx on port 5173.

Running Tests

cd backend && python -m pytest tests/ -v

29 tests covering all four parsers, edge cases, and the token counter. The full suite runs in under 100ms since nothing in it hits an external service.

API Reference

Supported Trace Formats

The platform accepts four input formats, selectable via a dropdown on upload. Each has its own parser that handles the vendor-specific schema and reduces it to the normalized structure.

LangSmith - Parses JSON exports from the LangSmith tracing platform. The parser extracts runs, the messages inside each run, token counts from usage metadata, and any chain-level summarization events.

OpenTelemetry - Parses OTEL-format JSON spans. The parser traverses the span tree, reconstructs message history from span attributes, and identifies compaction events from span names containing "compress" or "summarize".

AgentOps - Parses AgentOps session JSON exports. The parser handles the session-level event structure and normalizes message roles from AgentOps event types.

Custom JSON - A generic format for any agent framework not listed above. It expects a messages array with role, content, and optional tokens and timestamp fields. Any event with type: "compaction" or type: "summarization" is treated as a compaction event.

Project Structure

context-compaction-visualizer/
├── backend/
│   ├── main.py                  # FastAPI app, 7 endpoints
│   ├── models.py                # Trace, Message, CompactionEvent ORM
│   ├── schemas.py               # Pydantic validation schemas
│   ├── database.py              # SQLAlchemy + SQLite setup
│   ├── parsers/
│   │   ├── langsmith_parser.py
│   │   ├── otel_parser.py
│   │   ├── agentops_parser.py
│   │   └── custom_parser.py
│   ├── services/
│   │   ├── context_analyzer.py  # Claude-powered info loss detection
│   │   └── token_counter.py     # Token counting + cost estimates
│   ├── requirements.txt
│   ├── Dockerfile
│   └── tests/
│       ├── test_parsers.py      # 29 tests covering all 4 parsers
│       └── fixtures/
│           ├── langsmith_trace.json
│           ├── otel_trace.json
│           ├── agentops_trace.json
│           └── custom_trace.json
├── frontend/
│   ├── src/
│   │   ├── App.tsx              # Upload/Timeline/Replay/Analytics/Loss/Compare tabs
│   │   ├── components/
│   │   │   ├── TraceUploader.tsx
│   │   │   ├── ContextTimeline.tsx   # D3.js stacked bar chart
│   │   │   ├── SessionReplay.tsx     # Turn navigation + compaction diff
│   │   │   ├── TokenAnalytics.tsx
│   │   │   ├── InfoLossDetector.tsx
│   │   │   └── ComparativeView.tsx
│   │   ├── hooks/useD3.ts
│   │   ├── api/client.ts
│   │   └── types/index.ts
│   ├── Dockerfile
│   ├── package.json
│   └── vite.config.ts
├── docker-compose.yml
└── .env.example

The file structure reflects the normalization design directly. Every file under backend/parsers/ handles one vendor's schema and outputs the same structure. Nothing downstream - not main.py, not any frontend component - needs to know which parser ran. The two services, context_analyzer.py and token_counter.py, sit after all four parsers and only ever see the normalized output.

Key Design Decisions

Parser normalization - Each observability platform has a fundamentally different schema. Rather than handling platform-specific quirks in every component, all four parsers produce an identical normalized structure. This means the timeline, replay, analytics, and comparison views have no knowledge of the original format.

Graceful Claude fallback - The Info Loss Detector calls the Anthropic API only when ANTHROPIC_API_KEY is set. Without a key, it returns analysis_available: false with a clear message rather than failing. The rest of the platform works fully without any API key.

D3.js integration via hook - The useD3.ts hook manages D3's selection lifecycle within React's rendering model. D3 takes ownership of the SVG element inside the hook's effect, while React manages the wrapping div and props. This avoids the common conflict between React's virtual DOM and D3's direct DOM manipulation.

Centralized cost estimates - Token counts and cost calculations happen in token_counter.py using verified Claude pricing - $3 per million input tokens, $15 per million output tokens - defined as constants in one place, making them easy to update if pricing changes.

Environment Variables

Verified Results

The backend ships with 29 tests covering all four trace parsers, realistic multi-turn fixture data for each format, edge cases like empty inputs and missing fields, and the token counter. All tests run in under 100ms since no external services are called. The frontend builds to a 238 KB JS bundle across 600 modules.

For the info-loss detector, the ContextAnalyzer was run (using DeepSeek V4 Flash via OpenRouter for this verification pass) against a real compaction event that had dropped 77,000 tokens from a security code review session. It returned an overall risk score of 0.85 and flagged two losses. The higher-risk item, scored 0.90, was the permanent loss of three specific JWT authentication findings - a missing expiry check, absent refresh token rotation, and a weak secret key - detail precise enough that no summary would have preserved it. The second item, scored 0.70, flagged the loss of 23 tool call exchanges' worth of reasoning context. Both came back with concrete recommended actions, not generic advice.

How I Built This Using NEO

This project was built autonomously using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a platform that ingests execution traces from any of the major agent observability tools, reconstructs what an agent's context looked like turn by turn, and uses an LLM to flag when something important got dropped during compaction. NEO planned and produced the entire codebase - four format parsers that each reduce a different vendor schema into one normalized structure, a FastAPI backend with seven endpoints wired to SQLAlchemy models, two backend services handling token counting and Claude-powered info loss detection, a full React and TypeScript frontend with D3.js visualizations across six components, and a 29-test suite with realistic multi-turn fixtures for all four formats. The repo's plans/ directory and ORCHESTRATOR_LOG.md document that build run directly.

The result is a fully working visualization platform that takes a raw trace file in, and gives you back a complete picture of what your agent remembered, what it forgot, and what that cost.

How You Can Use This With NEO

Audit any long-running agent for context loss.
If a LangSmith, OpenTelemetry, or AgentOps trace exists for an agent run, it can be dropped straight into the platform. The timeline and session replay immediately show which turns survived compaction and which did not - no instrumentation changes, no code modifications to the agent itself.

Benchmark compaction strategies before shipping.
When evaluating two different agent configurations or memory strategies, both traces can be uploaded and placed side by side in the comparative view. The platform surfaces which strategy retained more context at lower token cost, turning a subjective comparison into a measurable one.

Catch silent information loss in security or compliance-sensitive agents.
The Claude-powered info loss detector scores each compaction event and flags specific content that may have been dropped - as demonstrated with the JWT authentication findings in the verified results. Any agent operating over sensitive or constraint-heavy sessions can be run through this check before the output is trusted. This requires ANTHROPIC_API_KEY to be set; without it the platform returns analysis_available: false for this feature.

Use the custom JSON format to bring any agent framework in.
Agents not running on LangSmith, OpenTelemetry, or AgentOps can still feed into the platform by logging to the custom JSON format - a messages array with role, content, and optional tokens and timestamp fields. Any agent framework that can write JSON can produce a trace this platform accepts.

Final Notes

Compaction is designed to be invisible - the agent keeps running, the framework handles the limit, and nothing interrupts the workflow. The cost of that invisibility is that when something is silently dropped, there is no record of what it was or what it was worth. Context Compaction Visualizer produces that record.

The code is at https://github.com/dakshjain-1616/Context-Compaction-Visualizer
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Top comments (4)

Alex Shev • Jun 23

Context compaction needs this kind of visibility. The scary failure is not that the agent forgets something; it is that the user cannot see what was compressed away before the next action depends on it.

Nazar Boyko • Jun 24

The comparison view is the feature I'd reach for first. Showing what got dropped after the fact is useful for a postmortem, but putting two compaction strategies side by side turns "which memory setup is better" into a number you can check before shipping. One honest tension worth noting is that the loss detector uses a model to judge what a model dropped, so its own summary of the damage is lossy in the same way. Even so, a rough score that names three specific findings beats the current default, which is finding out later that the agent forgot them.

Raffaele Zarrelli • Jun 24

What I like here is that you turned compaction from a silent event into an inspectable artifact, that alone moves the conversation forward. The angle I would add sits one step upstream: the visualizer shows what got dropped, but the sharper question it surfaces is why a load-bearing constraint was in the compaction-eligible window at all. If a decision or constraint lives as explicit operating state in a file the agent reloads by relevance, compaction never gets a vote on it, because it was never just rolling history hoping to survive the summarizer. So I would almost use your tool as a misplacement detector: if the info-loss scorer keeps flagging the same kind of finding, that is the tell it belongs in durable inspectable state, not in the window. That is the whole bet behind cowork-os, keep the load-bearing context out of the window and on files you can read, MIT and one command for Claude Cowork or Code, a star helps me prioritize if the framing lands. Of the events your detector scores high-risk, how many are ones that arguably should never have been in the compaction path to begin with?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.