Building Lookspan: local-first observability & replay for LLM apps (v0.4.0)

#llm #observability #ai #devtools

I've been building Lookspan — a local-first observability and replay tool for apps that use LLMs — and wanted to share where it's at after the latest release.

The problem

When your app calls an LLM, what actually happened is mostly a black box: which prompt went out, what came back, which tools fired, and why the output changed between runs. Most observability stacks were built for plain HTTP services, not for the non-deterministic world of LLM calls.

What Lookspan does

Capture spans/traces of your LLM calls — prompts, responses, tool calls. It's MCP-native, so it plugs into the ecosystem instead of locking you in.
Replay & diff — re-run a captured trace and compare outputs side by side. Perfect for catching regressions when you tweak a prompt or swap a model.
LLM-as-judge — score outputs automatically instead of eyeballing them.
Local-first — your traces stay on your machine. No vendor, nothing leaves your laptop.

New in v0.4.0: datasets & experiments

The headline addition is a real evaluation loop:

Define a test set of inputs.
Run a batch through your app.
Judge the results (LLM-as-judge).
See the aggregates — pass rates, diffs, trends.

It turns "I think the new prompt is better" into a number you can actually compare.

The road here

0.2 — multi-agent capture
0.3 — replay/diff + LLM-as-judge
0.4 — datasets & experiments

Try it

npx lookspan

It's on npm: lookspan.

It's still early and I'd love feedback — what would you want from an LLM observability tool you can run entirely locally?

DEV Community