Praveen

Posted on Jun 10

You Cannot Retroactively Capture AI Code Provenance. Here Is What You Lose Every Day You Wait.

#ai #discuss #showdev #startup

There is a failure mode in AI code governance that does not get enough attention because it is invisible until it isn't.

It is not a vulnerability. It is not a misconfiguration. It is not something a security scan will catch.

It is a gap in time: the period between "your team started using AI coding tools" and "your team started recording what those tools did." Every line of
code generated in that gap is permanently unattributable. The prompt is gone. The model is gone. Whether the suggestion was reviewed, modified, or
auto-accepted is gone.

You cannot go back. Retroactive provenance capture does not exist.

## The One-Way Door

When a developer prompts Claude Code and accepts a suggestion, the following information exists briefly in memory and in transit:

The full prompt sent to the model
The model identifier and version
The generated code, before any human edits
Whether the insertion was accepted, rejected, or applied with modifications
The file path and surrounding context at the time of generation

At the moment the developer's editor applies the change and moves on, most of that information disappears. Git records the diff and the commit author.
Nothing else records provenance by default.

The commit is not the generation event. These happen at different times with different context. Understanding this distinction is the precondition for any
serious AI governance posture.

If you were not capturing at the moment of generation, you were not capturing. There is no reconstruct operation.

## What You Actually Lose

Let's be specific about what "unattributable" means in practice.

Scenario 1: The incident trace.
A bug surfaces in src/payments/processor.py. You trace it to a block inserted six weeks ago. Git blame gives you a developer name and a commit hash. What
you cannot recover: the prompt that produced the block, the model that generated it, whether the developer reviewed it or accepted it from the first
suggestion, and what the risk patterns in that insertion looked like at generation time. You are debugging code whose origin is opaque.

Scenario 2: The compliance question.
Your company is asked — by an enterprise customer, an auditor, or your own legal team — to document AI usage in the SDLC. The EU AI Act Articles 11, 12,
and 14 have enforcement teeth from August 2026. The question is: which code was AI-generated, which model produced it, and what review occurred?

If you have been capturing since January, you have a full audit trail. If you started capturing this week because someone asked the question, you have a
full audit trail from this week forward and nothing before that.

Scenario 3: The departing engineer.
A developer who used AI tools heavily leaves the team. Their code is in the codebase, some of it AI-generated. The team has no record of which blocks were
AI-generated, what was prompted, or what the risk posture of those blocks is. Onboarding the next developer is a code archaeology project that cannot be
fully resolved.

## What Starting Looks Like — Zero Configuration Required

The reason the irreversibility argument matters so much is that the cost of starting is zero.

LineageLens Base installs in one command and starts capturing immediately:

code --install-extension karnatipraveen.lineagelens-base

No backend. No proxy. No account. No configuration. No API key.

The extension activates the moment it installs. It hooks into VS Code's onDidChangeTextDocument event and watches for insertions of 4 or more lines. When
a qualifying insertion occurs, it captures:

File path and language
Inserted code block
Net lines added
Confidence score (0.0–1.0)
Source classification (cursor, copilot, unknown, etc.)
UTC timestamp

Records are stored in VS Code global state — a local JSON store on the developer's machine. No data leaves the machine. Status bar shows LL: Easy (local).

From this moment forward, every AI insertion of 4+ lines has a record. The record is sparse (no prompt, no model name — those require the proxy) but it
exists, and its timestamp is authoritative.

## Upgrading Record Quality Without Reinstalling

When you want full prompt and model capture, add the Lite proxy alongside it:

git clone https://github.com/karnati-praveen/lineagelens
cd lineagelens
bash lineagelens-scripts/quickstart-lite.sh

Open http://localhost:8787/setup, create your admin account in three browser steps, then set one environment variable:

export ANTHROPIC_BASE_URL=http://localhost:8788
export OPENAI_BASE_URL=http://localhost:8788

The extension polls /proxy-health every 30 seconds. When it detects the proxy, the status bar switches from LL: Easy (local) to LL: Power automatically.
Records captured after that point include the full prompt, model identifier, and applied/rejected status.

Records captured in Easy Mode before the proxy was running remain in the store with capture_status: file_diff and confidence ~0.35. They are not
retroactively enriched. But they exist. That is the point.

## The Confidence Gradient

Here is what record quality looks like across capture configurations:

Capture mode capture_status Confidence Prompt captured?
---Proxy (Power Mode) full 0.80 – 1.00 Yes
Proxy (tunneled) tunnel_only 0.60 – 0.80 Partial
Extension only file_diff 0.25 – 0.45 No
No capture — — —

A file_diff record at confidence 0.35 is not a rich provenance record. But it is infinitely better than no record for a specific reason: it is timestamped
and file-attributed at generation time, not commit time.

When an incident occurs six months later and you are tracing a bug to a specific block, a file_diff record tells you approximately when this code appeared
in the file, that it passed the threshold for "likely AI-generated," which AI tool extension was probably active, and what the file path was. That is
enough to narrow a 90-day window to a specific week. Without the record, the investigation starts from zero.

## For Teams: The Asymmetry Compounds

If your team has ten developers using a mix of Cursor, Copilot, and Claude Code, and none of them are running LineageLens, you have a growing body of
unattributable AI-generated code accumulating daily. Every sprint without capture is a sprint of records that cannot be recovered.

LineageLens Lite adds a shared backend without requiring Postgres:

bash lineagelens-scripts/quickstart-lite.sh

Single Docker container. SQLite. Runs on a $5 VPS or a spare machine. The setup wizard creates the admin account and workspace in three browser steps.
Share the proxy URL and one environment variable with the team. From that point forward, every developer's AI tool traffic is captured and stored
centrally.

## The Direct Argument

LineageLens might be the right tool for your team or it might not be. That is worth evaluating. But the evaluation should happen today, not when you
decide you need it — because the cost of deciding you need it after the fact is permanent.
Base is free.Both are MIT-licensed and fully self-hosted.
The cost of starting is 30 seconds and one command:

*code --install-extension karnatipraveen.lineagelens-base
*
After you try it: what is the oldest piece of AI-generated code in your codebase that you cannot explain the origin of? How far back does your
unattributable window go?

Top comments (2)

Praveen • Jun 10

Support the vscode extension!!

Praveen • Jun 10

its available in all popular ide"s try it out !!