I've been building Lookspan — a local-first observability and replay tool for apps that use LLMs — and wanted to share where it's at after the latest release.
The problem
When your app calls an LLM, what actually happened is mostly a black box: which prompt went out, what came back, which tools fired, and why the output changed between runs. Most observability stacks were built for plain HTTP services, not for the non-deterministic world of LLM calls.
What Lookspan does
- Capture spans/traces of your LLM calls — prompts, responses, tool calls. It's MCP-native, so it plugs into the ecosystem instead of locking you in.
- Replay & diff — re-run a captured trace and compare outputs side by side. Perfect for catching regressions when you tweak a prompt or swap a model.
- LLM-as-judge — score outputs automatically instead of eyeballing them.
- Local-first — your traces stay on your machine. No vendor, nothing leaves your laptop.
New in v0.4.0: datasets & experiments
The headline addition is a real evaluation loop:
- Define a test set of inputs.
- Run a batch through your app.
- Judge the results (LLM-as-judge).
- See the aggregates — pass rates, diffs, trends.
It turns "I think the new prompt is better" into a number you can actually compare.
The road here
- 0.2 — multi-agent capture
- 0.3 — replay/diff + LLM-as-judge
- 0.4 — datasets & experiments
Try it
npx lookspan
It's on npm: lookspan.
It's still early and I'd love feedback — what would you want from an LLM observability tool you can run entirely locally?
Top comments (0)