Nilofer 🚀

Posted on May 16

Context Time Machine: Forensic Investigation of What Your Agent Actually Saw

#agents #machinelearning #python #opensource

Long-running agent sessions fail in a specific way that is hard to debug. The agent runs 40 turns. At turn 38, it gives a wrong answer that ignores something it decided at turn 12. You look at the logs, the turn 12 decision is there. The turn 38 response is there. But you cannot see what the context window looked like at turn 38. Was the turn 12 decision still in context? Was it evicted? Was it there but semantically overwhelmed by 25 other turns?

This is the forensic problem that ContextTimeMachine solves. It is different from real-time session monitoring, it is for deep post-hoc investigation of what happened during a session, after it has already run. The key insight it is built on: the context window at any given turn is deterministic given the conversation history. You can reconstruct exactly what the model saw at turn 38, render it interactively, and query it.

Three Investigation Modes

Mode 1 - Timeline Navigator

The primary view is a vertical timeline of all turns in the session. Each turn shows the turn number, agent name if available, turn type, token count at that turn, and a sparkline showing how the context composition changed.

Click any turn to travel to it - the context window at that exact point reconstructs and renders in the main panel. You see exactly what the model saw: every message in order, with token counts, with a red line showing where the context would have been truncated if it exceeded the model's limit. Scrub through turns with keyboard arrows. Watch the context window evolve turn by turn. See turns disappear as eviction happens. See tool results arrive and push older content further back.

Mode 2 - Fact Tracker

You know something specific, a decision made at turn 5, a fact retrieved at turn 15, a user instruction given at turn 3. You want to know: at what turn did this fact leave the context window?

Enter any text snippet in the Fact Tracker search box. ContextTimeMachine embeds it locally using sentence-transformers, then searches every turn's context snapshot for the nearest matching content. It renders a presence chart, a horizontal bar across all turns colored green when the fact is present or red when absent and shows the exact turn where the fact entered context and the exact turn where it left.

This answers the most common debugging question for long agent sessions: "When exactly did the agent stop knowing X?"

Mode 3 - Divergence Finder

You have two agent sessions that started identically but ended differently. One succeeded, one failed. Load both sessions and ContextTimeMachine finds the earliest turn where their context windows diverged where they started seeing different content and highlights that turn as the likely root cause of the different outcomes.

It shows a side-by-side comparison of the two context windows at the divergence point with diffed content highlighted. This is the automated version of the manual debugging process every team does when comparing "the run that worked" against "the run that didn't."

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ContextTimeMachine                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Frontend (React)                                               │
│  ├─ TimelineNavigator    — Turn-by-turn timeline scrubber       │
│  ├─ ContextPanel         — Renders reconstructed context        │
│  ├─ FactTracker          — Fact presence chart                  │
│  └─ DivergenceFinder     — Two-session comparison               │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  FastAPI Backend                                                │
│  ├─ /api/session/load          — Load session from file         │
│  ├─ /api/session/{id}/profile  — Get token profile              │
│  ├─ /api/session/{id}/turn/{n} — Reconstruct context at turn    │
│  ├─ /api/session/{id}/fact     — Track fact presence            │
│  ├─ /api/divergence            — Find divergence point          │
│  └─ /api/sessions              — List all sessions              │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Core Analysis Modules                                          │
│  ├─ SessionLoader        — Load from multiple formats           │
│  ├─ ContextReconstructor — Reconstruct at any turn              │
│  ├─ FactTracker          — Track presence via embeddings        │
│  ├─ DivergenceFinder     — Find divergence points               │
│  ├─ TokenAnalyzer        — Token budget analysis                │
│  └─ EmbeddingService     — Local embeddings (all-MiniLM)        │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Storage                                                        │
│  └─ SQLite DB            — Session snapshots & metadata         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Installation

Prerequisites

Python 3.10+
pip

Quick Start

# Clone the repository
git clone https://github.com/dakshjain-1616/context-time-machine.git
cd context-time-machine

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
pip install -e .

# Start the server
timemachine serve

Open http://localhost:8000 in your browser. The server will automatically open your browser if it can.

Usage

Loading Sessions
Sessions can be loaded from two formats.

From LiveContext SQLite Export:

timemachine load --file session.db

From Generic JSON:

timemachine load --file session.json

The generic JSON format expects a turns array where each turn contains a messages list, a model_id, and a timestamp:

{
  "turns": [
    {
      "turn": 0,
      "messages": [
        {"role": "system", "content": "You are helpful.", "token_count": 3},
        {"role": "user", "content": "What is 2+2?", "token_count": 4}
      ],
      "model_id": "gpt-4",
      "timestamp": "2026-05-09T10:00:00Z"
    }
  ],
  "model_id": "gpt-4"
}

CLI Commands
The CLI covers the full workflow from loading sessions to querying them:

# Start the web interface
timemachine serve

# Load a session
timemachine load --file session.json

# Track fact across session
timemachine fact --session <session-id> --fact "the user prefers JSON output"

# Find divergence between two sessions
timemachine diverge --session-a <id-a> --session-b <id-b>

# List all stored sessions
timemachine sessions

# Clear all sessions
timemachine clear

Python API

Every capability the CLI and web interface expose is also available as a Python library. This makes it straightforward to integrate ContextTimeMachine into evaluation pipelines or automated debugging scripts:

from context_time_machine import (
    SessionLoader,
    ContextReconstructor,
    FactTracker,
    DivergenceFinder,
    TokenAnalyzer,
)

# Load session
loader = SessionLoader()
session = loader.load("session.json")

# Reconstruct context at turn 10
reconstructor = ContextReconstructor()
context = reconstructor.reconstruct(session, turn_number=10)
print(f"Context at turn 10: {context.total_tokens} tokens")
print(f"Messages: {len(context.messages)}")
print(f"Utilization: {context.utilization_percent}%")

# Track a fact
tracker = FactTracker()
result = tracker.track(session, "specific decision from turn 5")
print(f"Fact first appeared: Turn {result.first_appeared_turn}")
print(f"Fact last present: Turn {result.last_present_turn}")
print(f"Disappeared at: Turn {result.disappeared_at_turn}")

# Analyze token budget
analyzer = TokenAnalyzer()
profile = analyzer.analyze_session(session)
print(f"Peak tokens: {profile.peak_tokens} at turn {profile.peak_turn}")
print(f"Eviction turns: {profile.eviction_turns}")

# Find divergence between sessions
session_b = loader.load("session_b.json")
finder = DivergenceFinder()
result = finder.find(session, session_b)
print(f"Divergence at turn: {result.divergence_turn}")
print(result.summary)

Supported Session Formats

How It Works

Context Reconstruction

For each turn N, ContextTimeMachine loads all messages from turns 0 to N and counts the total tokens using tiktoken. If the total exceeds the model's context limit, it simulates eviction using a model-specific strategy: GPT and Claude use left-truncation (oldest messages first), DeepSeek uses a sliding window with a recency bias, and Gemma uses local-global attention sampling from the middle. System messages are never evicted regardless of which strategy applies. The result is a reconstructed context with a full token breakdown exactly what the model would have seen at that turn.

Fact Tracking

For each turn, ContextTimeMachine embeds the fact text using all-MiniLM-L6-v2. It then computes cosine similarity between that embedding and every message in the turn's reconstructed context. A fact is considered present if any message has a similarity above 0.75. Embeddings are cached for performance so repeated queries against the same session do not recompute embeddings. The output is a presence chart showing the fact's full lifecycle across the session.

Divergence Detection

For two sessions, ContextTimeMachine aligns turns and analyzes up to the minimum length of the two sessions. At each turn it reconstructs the context for both sessions, embeds all messages, and computes an average maximum cosine similarity between the two context windows. When this similarity drops below 0.85, the turn is flagged as the divergence point. The output includes a message diff at the divergence point and a summary of what changed.

API Endpoints

Session Management

POST /api/session/load - load session from file or JSON
GET /api/sessions - list all stored sessions
DELETE /api/session/{id} - delete a session

Analysis

GET /api/session/{id}/profile - get token profile for session
GET /api/session/{id}/turn/{num} - reconstruct context at turn
POST /api/session/{id}/fact - track fact presence
POST /api/divergence - find divergence between sessions

Performance

Context Reconstruction: < 100ms for typical sessions
Fact Tracking: ~1-5 seconds for full session (includes embedding)
Divergence Detection: ~2-10 seconds for 2 sessions
Memory: ~50-200MB per stored session (depending on size)

Dependencies

Core

fastapi - Web framework
uvicorn - ASGI server
pydantic - Data validation
click - CLI framework
tiktoken - Token counting
sentence-transformers - Local embeddings
numpy - Numerical operations
sqlalchemy - Database ORM
aiofiles - Async file operations

Frontend
React, Tailwind CSS, Framer Motion, Recharts

Known Limitations

Frontend is a React stub - core analysis is fully functional
LangSmith format not yet implemented
No streaming support for very large sessions (>10k turns)
Embedding cache cleared on restart

Future Enhancements

Complete React frontend with real-time updates
WebSocket streaming for large sessions
LangSmith format support
Multi-session comparison UI
Export to markdown/HTML
Attention visualization
Custom eviction strategy support

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a forensic debugging tool for long-running agent sessions, one that could reconstruct the exact context window at any historical turn, track when specific facts entered and left context using semantic embeddings, and find the earliest point where two divergent sessions started seeing different content. The tool needed to support multiple session formats, expose a Python API alongside the web interface, and work entirely offline with local embeddings.

NEO handled all 12 specification steps autonomously, building the SessionLoader with support for LiveContext SQLite, generic JSON, and raw conversation formats, the ContextReconstructor with model-specific eviction strategies for GPT, Claude, DeepSeek, and Gemma, the FactTracker with all-MiniLM-L6-v2 embeddings and cosine similarity scoring, the DivergenceFinder with turn-aligned context comparison, the TokenAnalyzer for peak token and eviction turn detection, the FastAPI backend with all six API endpoints, the SQLite storage layer via SQLAlchemy, the Click CLI with all six commands, and the full 58-test suite covering all core modules.

How You Can Use and Extend This With NEO

Use it to find the root cause of long-session failures.
When an agent gives a wrong answer deep into a long session, load the session into ContextTimeMachine, travel to the failure turn in the Timeline Navigator, and see exactly what was in context at that point. The reconstructed view shows every message the model saw, in order, with token counts, so you can see immediately whether the relevant context was present or had been evicted.

Use Fact Tracker to measure context retention across your agent design.
Before settling on a context management strategy for your agent, run Fact Tracker against a set of real sessions. The presence chart for key decisions and instructions tells you at what turn they reliably drop out of context giving you a data-driven basis for choosing context window sizes, eviction strategies, or compression approaches.

Use Divergence Finder to debug non-deterministic agent behaviour.
When two runs of the same agent with the same input produce different outcomes, load both into Divergence Finder. The tool identifies the exact turn where their context windows started differing and shows a diff of what changed, turning a difficult debugging problem into a specific, actionable finding.

Extend it with additional session format parsers.
SessionLoader already handles three formats following a common interface. Adding a new format - LangSmith is listed as planned, means implementing the same loader interface for the new format. It is then immediately available in the CLI, the Python API, and the web interface without touching any of the analysis modules.

Final Notes

ContextTimeMachine makes the context window visible. Instead of inferring what the model saw from its outputs, you can reconstruct and inspect the exact context at any turn, track when specific information entered and left the window, and find where two sessions diverged. For teams debugging long-running agents, that visibility is the difference between guessing and knowing.

The code is at https://github.com/dakshjain-1616/ContextTimeMachine
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

DEV Community