decker

Posted on Mar 2

Claude Code session replay: what I learned after 10 days of tracking every AI decision

#claudecode #ai #devtools #productivity

I almost gave up on Claude Code three weeks in.

Not because it was bad — it was genuinely impressive. But I kept running into the same wall: I'd start a session with clear context, make real progress, and then hit the session limit. The next day I'd pick it up again, spend 20 minutes re-explaining everything, watch Claude confidently make a decision I'd already tried and discarded, and lose another hour undoing it.

That's when I started tracking every AI decision I made. Every single one. For 10 days straight.

What I learned changed how I think about AI-assisted development entirely.

Why I Started the Experiment

The trigger was a specific failure. I was building a data pipeline, and Claude had helped me design an elegant solution using a particular caching strategy. We talked through the tradeoffs, I pushed back on one approach, Claude agreed and we went with something simpler.

Three sessions later, Claude proposed the exact caching strategy we'd already ruled out. With the same reasoning. I said "we already tried this," and of course Claude had no idea what I meant.

I started wondering: how much time was I actually losing to this? How many decisions were getting re-litigated? How much of my Claude Code workflow was just... rebuilding context?

So I built a simple tracker. A markdown file per session, timestamped, with three fields for each decision point: what the AI proposed, what I decided, and why.

The Tracking System

Nothing fancy. I kept a decisions.md file in every project root:

## Session 2026-02-15

### Decision: API retry logic
- Proposed: Exponential backoff with jitter, max 5 retries
- Decided: Simple 3-retry with fixed 2s delay
- Why: External API has rate limits that make jitter counterproductive; simpler to reason about

### Decision: Error handling strategy
- Proposed: Custom error classes per domain
- Decided: Keep native errors, add context via cause chain
- Why: Team is small, over-engineering error taxonomy costs more than it saves

By day three, I had something unexpected: a searchable record of my own thinking. Not Claude's thinking — mine. The decisions I made when I had full context, when I was fresh, when I understood the problem completely.

I also started adding a "context dump" at the top of each session's section — a one-paragraph summary of where the project stood, what constraints were in play, and what I was trying to accomplish that day. This became the thing I'd paste into Claude at the start of each session.

What the Data Showed

After 10 days, 23 sessions, and 147 logged decision points, some patterns were hard to ignore.

Context decay is real and measurable. On average, I was spending 18 minutes per session re-establishing context before I could do productive work. That's 25% of a typical 70-minute coding session gone before writing a line. Across 23 sessions, that's roughly seven hours of lost time in just ten days.

The most expensive decisions were the ones made in session 1. Architecture choices, data model decisions, API contract decisions — these came up again in later sessions, and without the log, Claude would sometimes propose alternatives that seemed reasonable in isolation but contradicted earlier choices. I watched this happen four distinct times. Each one cost at least 30 minutes to untangle.

"Why" was the hardest thing to capture. It was tempting to just log what I decided. But the reason mattered more. "We went with SQLite because we need zero-config deployment" is worth ten times more than "we went with SQLite." The bare decision without reasoning is almost useless in a later session. Claude can read that we chose SQLite but won't know not to suggest PostgreSQL unless it knows why.

I was better at decisions early in sessions. Looking back at the logs, my reasoning in session-start notes was cleaner and more considered. By the end of long sessions, I was making decisions faster and documenting less. Session fatigue is real. The decisions I made in the last 15 minutes of a long session were consistently lower quality. Logging made this visible.

Not all context loss is Claude's fault. I'd expected to find that Claude was the weak link — it forgets, sessions expire, the context window has limits. That's all true. But I also found that I was inconsistent. I'd remember a decision differently than how I'd logged it. My mental model of the codebase drifted from reality. The log was often more accurate than my own memory.

The Workflow That Actually Worked

By day 7, I'd settled into a rhythm:

# Start of every session
cat decisions.md | tail -50  # Review recent decisions
# Paste the last 5-10 entries into Claude as context

# Template I evolved to use:

## Session context for Claude:
Recent decisions that affect this work:
- [date] Chose X over Y because Z
- [date] Rejected approach A — creates problems with B
- [date] Constraint: must maintain backward compat with API v1

Current goal: [specific, scoped task]
Out of scope today: [things Claude might try to fix that I don't want touched]

The difference was immediate. Sessions started faster. Claude stopped re-proposing rejected approaches. I stopped repeating myself.

But it was also manual. Every session, I was copy-pasting context. Every session, I was making sure the log was updated before I closed the terminal. I built a small shell alias to remind me:

alias end-session='echo "Update decisions.md before you close this terminal"'

Embarrassingly low-tech. But it worked.

The Hidden Cost I Hadn't Accounted For

Midway through the experiment I noticed something I hadn't expected: the tracking itself was making me make better decisions in real time.

Knowing I'd have to write down "why" forced me to actually think it through before committing. A few times I started writing the reasoning and realized mid-sentence that I didn't actually have a good reason — I was just going with Claude's suggestion because it sounded fine and I was tired.

The log introduced a small amount of productive friction. Not enough to slow down the work, but enough to catch the low-effort decisions.

This is the part that doesn't show up in productivity numbers. The decisions I didn't make badly because I had to justify them in writing.

What I Was Actually Missing

About a week in, I found Mantra while looking for tools that handled this kind of tracking automatically.

The core idea aligned with what I'd been doing manually: every AI action gets bound to a git snapshot. Every decision Claude Code makes is recorded alongside the exact state of the codebase at that moment. Session replay means you can go back to any point in the AI's work on your project and see what it did, why, and what the code looked like.

What took me a week to build a rough manual version of — Mantra does automatically, and ties it to something even more useful than a markdown log: actual git history. You can go back to any checkpoint, see what Claude proposed, see what the code looked like at that moment, and replay or branch from there.

The session replay piece in particular was what I'd been missing. Not just "what did we decide" but "what did the codebase look like when we decided it." That context is the difference between a decision log and actually understanding why the code is the way it is.

I've been using it for the past week instead of my manual system. The friction of context re-establishment has dropped significantly. I spend maybe 5 minutes at the start of a session instead of 18.

What I'd Tell Someone Starting Out

If you're using Claude Code seriously and not tracking decisions, you're losing time to a problem that's easy to fix. The productivity gains from AI assistance are real, but they come with a hidden cost: your context is ephemeral, the AI's memory is ephemeral, and the combination creates a slow bleed of time and quality.

A few things that actually helped:

1. Log the "why," not just the "what."
The decision itself is secondary. In three weeks, you won't remember why you picked approach A. Future-you (and future Claude) needs the reasoning. Force yourself to write one sentence of justification for every non-trivial decision.

2. Scope sessions aggressively.
Smaller, focused sessions with clear exit criteria lose less context than marathon sessions that sprawl across three subsystems. I started ending sessions at natural decision points rather than when I ran out of time or energy.

3. The session boundary is where quality drops.
The handoff between sessions is the highest-risk moment in AI-assisted development. Whatever system you use — manual logs, automated tools, anything — make the session boundary explicit and handled. Don't just close the terminal.

4. Your past decisions are architecture documentation.
After 10 days, my decisions.md was more useful than my README for understanding the project. The reasoning trail told the story of the system in a way that code comments and docs never do. It's the "why the code is this way" document you never have time to write.

5. Track what you rejected, not just what you chose.
The rejected alternatives are often the most valuable part. "We didn't use Redis because X" is load-bearing information. Someone (or some AI) will propose Redis again. You need the rejection reason on record.

The Numbers

After 10 days of tracking:

147 decision points logged across 23 sessions
~7 hours estimated lost to context re-establishment across the experiment
4 decisions explicitly flagged as "don't propose this again" that would have been re-proposed without the log
Avg session start time dropped from 18 minutes to about 8 minutes by day 10 (using the context-paste workflow)
3 times the log caught me making a decision I couldn't justify

The ROI of spending 2-3 minutes per session updating a decision log: real and measurable. The ROI of automating it entirely: the direction I've moved toward.

Where to Go From Here

If you want to try the manual version, start with a single project and commit to logging every non-trivial decision for a week. Just the decision and the reason. See what you learn about your own workflow.

If you want to see what automated session replay looks like, Mantra is worth checking out: mantra.gonewx.com. It's open source and designed to work alongside Claude Code, not as a replacement. The session replay feature is the part I'd been trying to build manually.

The goal isn't to make AI development frictionless — some friction is useful. The goal is to make sure the friction you're experiencing is productive friction, not just entropy from context decay.

Ten days of logging taught me that most of the friction I was experiencing was the second kind. And most of it was fixable.

DEV Community