Ryan Smith

Posted on May 28

I scanned Langfuse. It observes its own LLM calls through its own platform.

#webdev #javascript #ai #opensource

Post 3 of "Scanning Open Source." So far: Dub hides a fraud engine. Inbox Zero has prompt injection defense. The pattern: every project is architecturally bigger than its tagline.

Today: Langfuse — open source LLM observability platform. YC W23. 8K+ stars.

The scan

$ npx anatomia-cli scan .

langfuse                                                  web-app
TypeScript · Next.js · Prisma → PostgreSQL (65 models) · 7 packages

Stack
─────
Language     TypeScript
Framework    Next.js
Database     Prisma → PostgreSQL (65 models)
Auth         NextAuth
AI           LangChain
Payments     Stripe
Testing      Vitest, Playwright, Testing Library
UI           shadcn/ui (Tailwind)
Services     AWS S3 · Nodemailer · Sentry · PostHog · tRPC (+6 more)
Deploy       Docker · GitHub Actions
Workspace    Turborepo (pnpm)

Surfaces
────────
web      Next.js · Vitest
worker   TypeScript · Vitest

⚠ ~75 of 93 API route files may lack input validation

5 seconds. Two surfaces — a web app and a worker. The validation warning is worth context: Langfuse uses tRPC extensively, where validation happens via .input() schemas in the router layer — the scanner checks file-level imports and may not detect middleware-based validation. Here's what I found when I pulled threads.

Langfuse traces its own LLM calls through itself

This is the finding that made me stop and reread the code.

Langfuse uses LangChain internally to power features like the playground (where users test prompts against different models) and LLM-as-judge evaluations. The scan detected AI: LangChain — but the interesting part isn't that they use LangChain. It's HOW they trace those calls.

In getInternalTracingHandler.ts, Langfuse creates a callback handler using langfuse-langchain — their own open source LangChain integration package. Every internal LLM call flows through processEventBatch, the same ingestion pipeline that handles customer traces. The observability tool is observing itself.

This isn't debugging. It's architectural dogfooding. The team's own LLM usage generates production traces through the same pipeline their customers use. If the tracing breaks, they'd notice on their own dashboard before any customer reports it.

6 LLM providers through one abstraction

The scan detected LangChain as the AI SDK. When I traced the imports in fetchLLMCompletion.ts, six providers are wired up:

import { ChatAnthropic } from "@langchain/anthropic";
import { ChatVertexAI } from "@langchain/google-vertexai";
import { ChatBedrockConverse } from "@langchain/aws";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import { ChatOpenAI, AzureChatOpenAI } from "@langchain/openai";

Anthropic, Google Vertex, AWS Bedrock, Google Generative AI, OpenAI, and Azure OpenAI — all through LangChain as a unified interface. This powers the playground where users can test prompts across different models and the evaluation system where LLMs judge other LLMs' outputs.

24 worker queues

The scan detected two surfaces: web and worker. The worker has 253 source files and 24 separate queue processors — ingestion, evaluations, experiments, batch exports, data retention, integrations (PostHog, Mixpanel), OpenTelemetry ingestion, and more. Langfuse processes traces asynchronously — the web app accepts data, the worker processes, aggregates, evaluates, and routes it. The separation means trace ingestion never blocks the dashboard.

MCP server for prompt management from your IDE

26 TypeScript files in web/src/features/mcp/. Langfuse ships a Model Context Protocol server — you can manage prompts and query observation data directly from Claude Code or any MCP-compatible tool. Create a prompt, version it, label it, without leaving your editor. If you use Langfuse for prompt management AND Claude Code for development, this closes the loop between the two.

65 Prisma models tell you what an LLM platform actually needs

The model count alone isn't the story. It's what the models ARE:

Core tracing: traces, observations, sessions, media attachments
Evaluation: eval templates, job configurations, job executions, score configs
Human review: annotation queues, queue items, queue assignments
Prompt management: prompts, prompt dependencies, protected labels, LLM schemas, LLM tools
Automation: automations, triggers, actions, automation executions, monitors
Integrations: PostHog, Mixpanel, Slack, blob storage — each with its own model

The annotation queue system is worth noting. It's a human-in-the-loop review workflow — assign traces to reviewers, score them against configurable criteria, track completion. That's the bridge between "the AI said this" and "a human confirmed this was correct." Most observability tools stop at dashboards. Langfuse has a structured process for human judgment on AI output.

What this tells you

The self-tracing pattern is the thread that ties everything together. Langfuse runs LLM calls for the playground and evaluations. Those calls flow through their own ingestion pipeline, processed by their own worker queues, visible on their own dashboard. If you're evaluating Langfuse as an observability platform, the fact that they trust their own product with their own AI workload is the strongest signal in the codebase.

The annotation queue system is the second finding worth noting — a human-in-the-loop review workflow where you assign traces to reviewers, score them against configurable criteria, and track completion. Most observability tools stop at dashboards. Langfuse has structured the bridge between "the AI said this" and "a human confirmed this was correct."

Post 3 of "Scanning Open Source." Tomorrow: Formbricks.

npx anatomia-cli scan . — GitHub

DEV Community