I built a local Network tab for LLM calls (with evals), in .NET

#opensource #ai #dotnet #showdev

When you build something on top of an LLM, you mostly fly blind. You send a prompt, you get an answer, and the interesting parts are invisible: the exact text that went to the model after your code stitched it together, what that one call cost, how long it took, whether it called a tool, and whether a prompt or model change quietly made the answers worse.

In the browser you'd open the Network tab. For AI calls there wasn't a good local equivalent, especially in .NET. So I built one: Seerlens.

What it is

One command, and a local dashboard shows every LLM call your app makes, live:

dotnet tool install -g Seerlens
seerlens   # dashboard at http://localhost:5005

Then point your app at it. In .NET you wrap the IChatClient you already use:

using Seerlens.Sdk;

SeerlensTrace.Configure("http://localhost:5005");
IChatClient client = baseClient.UseSeerlens();

Every call through client shows up: the prompt, the completion, tokens, cost in dollars, latency, and any tool calls. It's local-first, SQLite, no signup, no cloud.

It's built on the OpenTelemetry GenAI conventions, so it isn't .NET-only (any OTLP app works, and there are small SDKs for Python and JS). But I lead with .NET on purpose.

Why .NET

The .NET AI stack grew up fast: Microsoft.Extensions.AI, Semantic Kernel, the Aspire dashboard. It traces calls well. What it doesn't do is judge whether the answer was any good, or turn tokens into a budget you can act on. The Python world has Langfuse, Phoenix, Promptfoo, Braintrust for that. .NET had basically nothing. That gap is the whole reason Seerlens exists.

The part I actually care about: evals

Tracing gets you in the door. The thing that separates "called an API once" from "ran AI for real" is catching quality regressions. You write a small golden set, score answers against it, and watch the trend over time.

You can gate CI on it:

seerlens eval support --min 0.8 --baseline .seerlens/support.base

That exits non-zero if the score drops below a floor, or regresses too far from a saved baseline. So a model swap that quietly drops answer quality becomes a red check on the PR, not a surprise in production.

There are three scorers: keyword (offline), llm-judge (grade against a rubric), and one I'm happy with, agent: it gives the model a case's tools, lets it actually call them, and scores whether it reached for the right tools in the right order. "Right answer, wrong tool path" shows up as a lower score.

What it's not

Worth being upfront, because it shapes whether it fits you. It's single-user and local: no auth, no shared dashboard. SQLite is fine for the dev loop, not a production firehose. The agent scorer uses tools you declare with canned results, it doesn't execute your real tools. And cost is only as fresh as the price table (which you can override). For a team watching production traffic, a deployed platform is the right call. This is the local dev-loop tool.

Try it

MIT licensed, on GitHub, NuGet, PyPI and npm.

Repo: https://github.com/eladser/seerlens
dotnet tool install -g Seerlens / pip install seerlens / npm install seerlens

It's a side project I've been building in the open, so I'd genuinely like feedback on where it falls short.

DEV Community