Three Tools, Three Layers: Sentry, Langfuse, and LangGraph for Multi-Agent Fleets

#ai #observability #mcp #langgraph

Multi-agent systems need three layers of visibility. System health, run quality, and workflow state. We run a stack of Sentry, Langfuse, and LangGraph for that. Three tools, three clearly separated jobs, none can solve the problem of the others. Here is how it plays together at our place and why exactly this combination.

Multi-agent systems have a property that everyone underestimates the first time they see one in production. A single LLM call is transparent. You put a prompt in, you get an answer out, you see both. A pipeline of eight agents collaborating over four days is a black box with eight doors, each one having a different assumption about what the other seven are currently doing.

We operate a fleet of around 40 worker agents distributed across eight specialized fleets. A pipeline that designs, builds, reviews, and publishes MCP servers. A memory product squad that continuously improves our SaaS memory. An academy content pipeline. SaaS operations. A chief-of-staff orchestrator as a layer-2 over the fleet CEOs. All on the Anthropic Claude Agent SDK in TypeScript, daily via cron.

The question is not how this scales technically. The question is how to keep quality high over time. Three tools answer that question in our architecture. Sentry, Langfuse, and LangGraph. Each solves a different problem, none can solve the problem of the others.

What multi-agent setups actually need

Three layers must be visible, otherwise you do not learn anything.

The first layer is system health. Which MCP server is slow, which tool call returns silent JSON-RPC errors instead of exceptions, where is the latency spike that breaks the cron runs. That is classic APM work, just over LLM calls and MCP servers instead of only web requests.

The second layer is run quality. Did the reviewer agent find the real bugs or just produce boilerplate findings. Did the architect deliver an executable plan or just nice text. For a single test run, a human can judge that. For 40 workers and 30 days, a system has to do it.

The third layer is workflow state. When a build subprocess crashes after 13 minutes, you do not want to restart the entire build. You want to resume from the last good checkpoint. When a tester delivers an approval-pending output, you want human-in-the-loop without writing custom code for it.

These three layers are orthogonal. Tools that try to do all three end up doing all three half-well. Tools that do one layer first-class combine into a stack that is better than the sum.

Sentry for errors and MCP server health

Sentry has had native MCP server auto-instrumentation since April 2026. One line of code per MCP server, and you immediately get a dashboard with the most-used tools, latency distribution per tool, error rate, client segmentation, and transport distribution. For five production MCP servers, that is five lines of code for a health layer that would otherwise take weeks to build.

The most important thing about it. The Anthropic MCP SDK treats errors as JSON-RPC responses instead of exceptions. When a tool crashes internally, the caller sees a success status with error content in the JSON. Classic stack-trace tools do not see that. Sentry does.

On the agent layer, Sentry auto-instruments the Anthropic SDK, writes tool-use loops as nested spans into the trace, and traces token counts even when pricing is flat-rate. Token volume remains a valuable quality proxy even when you do not pay per token. When an architect suddenly needs four times as many tokens without the output getting better, that is a drift signal.

The stack is OpenTelemetry-compliant and implements the GenAI Semantic Conventions v1.36. Meaning every other OTel tool can read the same spans. No lock-in to Sentry-specific span formats.

Langfuse for run quality, evals, and prompt management

Langfuse is the productive answer to the question whether the agents are getting better or worse over time. MIT-licensed, self-host possible, dedicated Claude Agent SDK integration in the TypeScript SDK v4.

Three capabilities that make the difference in practice.

Tracing. Multi-agent calls are visualized as agent graphs, not just a linear span tree. Who calls whom, which tool calls sit between them, where the trace breaks, where token consumption explodes.

Evals via LLM-as-judge. We maintain goldsets, that is curated test suites with expected outputs, and run them automatically on every code change to an agent. Custom evaluators score whether the agent delivered the expected findings. That makes run quality measurable over time instead of subjective. If the reviewer agent caught 18 out of 20 cases two weeks ago and now only 14, you know something has drifted and can investigate the cause.

Prompt management with versioning. Prompts do not live hardcoded in source files. Versioned, labels for A/B tests, meaning prod-a against prod-b running in parallel, performance per version automatically tracked by latency, tokens, and eval score. Rollback is one click, not a git revert.

Self-host runs on Docker plus Postgres plus ClickHouse, exactly the toolchain we already run on our AI server. License is MIT, all product features come without limits, only the enterprise modules for SCIM, audit log, and data retention policies need a license key.

LangGraph for stateful workflows

We use LangGraph selectively, not everywhere. Specifically in the sequences where a workflow runs across multiple subprocess calls and several hours, and a crash midway must not lead to a complete re-run. The MCP factory pipeline is the classic use case. Architect writes a plan, builder writes the code, reviewer finds the findings, tester does the live smoke. Four subprocesses, several hours, many places where something external can break. An npm install fail, a git clone timeout, an MCP tool call that hangs.

With LangGraph as a state graph with Postgres checkpointing, the workflow becomes durable. State, meaning plan path, build slug, findings, lives in the StateGraph, every node output is checkpointed. A crash at step three of four means resume from the last good checkpoint, not full restart. Tester output PARTIAL triggers interrupt() for manual approval. Human-in-the-loop without us building it ourselves.

We run LangGraph with a subprocess adapter. Each LangGraph node spawns our existing worker as a subprocess instead of making a LangChain ChatModel call. That has one important effect. Our workers stay unchanged on the Anthropic Claude Agent SDK, the pricing architecture stays intact, no switch to token-based billing. LangGraph orchestrates the workflow, the workers stay themselves.

The adapter is manageable. About 80 lines of TypeScript. The Postgres checkpointer is production-ready and creates three tables in its own schema. The MCP adapters from LangChain connect existing MCP servers transparently as LangChain tools, so no rewrite there either.

LangGraph brings one trade-off. Proprietary license, meaning lock-in risk if the pricing strategy changes. We accept that risk only where the resume value delivers real ROI. For 80 percent of our workflows, the Claude Agent SDK alone is enough.

Where the stack overlaps

One spot, one solution. Sentry and Langfuse both instrument LLM calls via OpenTelemetry. If Sentry initializes first, which it does by default, it swallows the Langfuse spans. The fix is documented. A shared TracerProvider, both SpanProcessors attached. Ninety minutes of setup, then both tools see their own spans.

If you do not know that upfront, you debug it for two days. If you know, you put it in the bootstrap file and forget about it.

What holds the stack together

Three properties we weighted in the tool selection.

Open-standards first. Sentry and Langfuse are both OTel-compliant with GenAI Semantic Conventions. Spans travel without code change. Whoever wants to send these spans to a third sink tomorrow, meaning Honeycomb, Datadog, or an in-house system, does not need a rewrite.

Self-host where possible. Langfuse is self-hosted on our own infrastructure. We use Sentry Cloud for convenience, but Sentry is self-hostable too. Data sovereignty stays controllable.

Respect flat-rate pricing. Our pricing architecture is flat-rate, not per token. Tools that would force us to switch to token-based billing would be real cost drivers. The subprocess adapter pattern keeps LangGraph compatible with this architecture.

Lessons that apply to everyone building multi-agent

From the stack build, not specific to one domain.

Token volume is also a quality proxy under flat-rate pricing. Suddenly four times as many tokens for the same output is a drift signal, regardless of whether you pay for it. Whoever does not trace this sees the drift only weeks later when the output gets noticeably worse.

MCP server errors as JSON-RPC responses instead of exceptions are a special class of bugs that classic APM tools do not catch. There is a toolchain that catches this, using it costs minutes and gives back days.

Stateful workflows with resume only make sense above a certain complexity. Single-step agents do not need this. Multi-step workflows over several hours with real external dependencies like npm, git, or third-party APIs benefit massively. The threshold for the switch is higher than most tutorials suggest. Whoever builds in LangGraph or a comparable orchestrator tool too early has more stack complexity than real-world value.

What the stack does not do

It does not make the architecture decisions. It does not do the prompt engineering work. It does not do the domain modeling. It makes visible what happens. What you learn from it is your work.

Sentry plus Langfuse plus LangGraph is three tools for three problems. Whoever has all three problems wins with the stack. Whoever has only one should install only one. Tooling sprawl is a real anti-pattern in solo setups and small teams.

At our place all three problems run through the fleet at the same time. That is why all three tools.

Matthias Meyer
Founder & AI Director at StudioMeyer. Has been building websites and AI systems for 10+ years. Living on Mallorca for 15 years, running an AI-first digital studio with its own agent fleet, 680+ MCP tools and 5 SaaS products for SMBs and agencies across DACH and Spain.