toolfreebie

Posted on May 28 • Originally published at toolfreebie.com

Langfuse: Free Open-Source LLM Observability

#ai #automation

What Is Langfuse?

Langfuse is a free, open-source LLM observability platform — the tool you reach for when your AI app works in the demo and then does something baffling in production. It records every model call, agent step, retrieval, and tool use as a structured trace you can open, read, and replay. Born out of Y Combinator’s W23 batch and now one of the most-starred LLM engineering projects on GitHub (langfuse/langfuse), it has become the default “what just happened?” layer for teams shipping anything more complex than a single chat completion.

The core of Langfuse is MIT-licensed and self-hostable, which is the part that matters for this blog: you can run the entire platform on your own machine or a cheap VPS for $0, forever, with no seat limits and no trace caps. There’s also a managed Langfuse Cloud with a genuinely free Hobby tier if you’d rather not run infrastructure. Either way, the SDKs, the integrations, and the trace UI are the same.

If you’re building with any of the free AI APIs covered here — Gemini, Groq, OpenRouter, Together — Langfuse is the missing piece that turns “I think the prompt is fine” into “here is the exact request, the exact response, the latency, and the cost.” This guide covers what LLM observability actually buys you, whether Langfuse is really free, how it compares to LangSmith and Phoenix, and how to instrument your first app in about ten minutes.

Why LLM Observability Matters

Traditional application monitoring assumes deterministic code: same input, same output, and a stack trace when something breaks. LLM apps break that assumption in three ways, and each one is a reason observability stopped being optional in 2026.

Non-determinism. The same prompt can return different answers on different days. Without a recorded trace of the exact input and output, “it gave a weird answer yesterday” is unreproducible and therefore unfixable.
Hidden multi-step chains. A single user message to an agent can fan out into a dozen model calls, retrievals, and tool invocations. When the final answer is wrong, the bug is usually three steps back — a bad retrieval, a truncated context, a tool that returned an error the model ignored. You need to see the whole tree.
Cost and latency creep. Token usage is invisible until the bill arrives. Observability surfaces per-call token counts and dollar estimates so you can catch the prompt that quietly grew to 40,000 tokens of context.

LLM observability gives you a recorded, searchable history of every AI interaction: the prompts, the completions, the latency, the token cost, the retrieved documents, and the tool calls — organized as nested traces so you can drill from a user session down to the single span that misbehaved. That’s the category Langfuse sits in, alongside LangSmith, Arize Phoenix, and Helicone.

Is Langfuse Really Free? Cloud vs Self-Hosted

“Free” means two different things with Langfuse, and both are real.

Self-hosted (free forever). The Langfuse core is open source under the MIT license. You run it yourself with Docker — a Postgres database, a ClickHouse analytics store, Redis, and the Langfuse web/worker containers, all wired up by the official docker compose file. There are no trace limits, no seat limits, and no feature gates on the open-source build beyond a small set of enterprise add-ons (SSO enforcement, fine-grained RBAC, audit logs) that live behind a commercial license. For an individual or a small team, the MIT build does everything you need.

Langfuse Cloud Hobby (free tier). If you don’t want to run infrastructure, Langfuse Cloud has a free Hobby plan that includes 50,000 units per month with no credit card required, according to the Langfuse pricing page (always check the page for the current limit — these numbers move). A “unit” is roughly one ingested observation, so 50,000/month comfortably covers a side project or an early-stage app in development.

Dimension	Self-Hosted (MIT)	Cloud Hobby (Free)
Price	$0 (you pay for the server)	$0, no credit card
Trace / event volume	Unlimited	50,000 units/month
Team seats	Unlimited	Limited on free tier
Data residency	Your infrastructure	EU or US region
Setup effort	One `docker compose up`	Sign up, copy two keys
Maintenance	You own upgrades & backups	Managed for you
Enterprise extras (SSO, RBAC)	Commercial license	Paid tiers

The honest rule of thumb: prototype on Cloud Hobby because it takes ninety seconds to start, and move to self-hosted the moment you either exceed the free volume, need unlimited seats, or have data-residency requirements that rule out a third party seeing your prompts.

Langfuse vs LangSmith vs Phoenix vs Helicone

Four tools dominate free-tier LLM observability in 2026, and they make different trade-offs between openness, framework lock-in, and how you wire them up.

Tool	Open source	Free path	Integration model	Best for
Langfuse	Yes (MIT core)	Self-host free + Cloud Hobby (50k units/mo)	SDK + decorators + OpenTelemetry, framework-agnostic	Teams who want a full platform they can also self-host
LangSmith	No (managed SaaS)	Free Developer plan (~5,000 traces/mo, 1 seat)	Tightest with LangChain / LangGraph	Teams already all-in on the LangChain stack
Arize Phoenix	Yes	Fully free to self-host	OpenTelemetry / OpenInference, notebook-first	Data scientists debugging in notebooks & evals
Helicone	Yes	Free tier (~10,000 requests/mo)	Proxy — change one base URL	The absolute lowest-effort drop-in logging

(Free-tier numbers above are from each vendor’s public pricing/docs and change often — verify on the linked page before you rely on them.)

The clearest dividing line is how they capture data. Helicone is a proxy: you point your OpenAI base URL at Helicone and it logs every request passing through — zero code changes, but it only sees what flows through the proxy. Langsmith and Langfuse use an SDK/instrumentation model: you wrap your calls or add a decorator, which means they can capture non-LLM steps (retrievals, tool calls, business logic) as spans in the same trace. Phoenix leans on the OpenTelemetry standard, which makes it portable but a little more setup-heavy.

Langfuse’s pitch is “open like Phoenix, full-featured like LangSmith, framework-agnostic unlike either.” If you want one platform that handles tracing, prompt management, and evals, and you want the option to self-host it for free, Langfuse is the broadest pick. If you live entirely inside LangGraph, LangSmith’s deeper native hooks may win on convenience.

Core Features That Matter

1. Tracing and Spans

The foundation. A trace represents one unit of work — typically one user request — and contains nested spans for each step inside it: the retrieval, each LLM call, each tool invocation. Langfuse shows this as an expandable tree with timing, token counts, and cost on every node. When an agent gives a bad answer, you open the trace and walk down to the exact span where the context went wrong. Traces can be grouped into sessions (a multi-turn conversation) and attributed to a user, so you can answer “show me everything user 4471 did this week.”

2. Prompt Management

Langfuse stores your prompts as versioned, named objects you fetch at runtime instead of hardcoding strings. You edit a prompt in the UI, label a version production, and your app picks it up without a redeploy. Every version is linked to the traces that used it, so you can see whether v4 of your system prompt actually reduced hallucinations versus v3. This is the feature that turns prompt engineering from “edit code, commit, deploy, hope” into something measurable.

3. Evaluations and Scoring

Langfuse can attach scores to any trace — from explicit user thumbs-up/down, from an LLM-as-a-judge evaluator, from a custom function, or from manual human annotation in the UI. Over time these scores become quality metrics you can chart: “answer relevance dropped 8% after we switched models.” You can run evaluators automatically on a sample of production traffic or against a fixed test set.

4. Datasets

A dataset is a curated set of inputs (and optional expected outputs) you run your app against to catch regressions before they ship. The natural workflow: find a trace where the app failed, click “add to dataset,” and that real-world failure becomes a permanent test case. Re-run the dataset after every prompt or model change and compare scores side by side.

5. Playground

An in-app prompt playground lets you grab a failing trace, tweak the prompt or swap the model, and re-run it immediately to see if your fix works — without leaving the tool or wiring up a script. It connects to your model providers, so you can A/B a prompt against Gemini and Groq in the same window.

6. Metrics and Dashboards

Aggregate views over all your traces: total cost per day, p95 latency per model, token usage by feature, score trends over time. This is where you notice that one endpoint is responsible for 70% of your spend, or that latency doubled the day you added a reranking step.

How to Self-Host Langfuse for Free

The fastest way to a free, unlimited Langfuse instance is the official Docker Compose stack. On any machine with Docker installed — including a free Oracle Cloud ARM VPS:

git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d

That brings up the full stack (Langfuse web + worker, Postgres, ClickHouse, Redis) and serves the UI at http://localhost:3000. Create an account on first load — it’s stored in your own database — make a project, and copy the public and secret API keys it generates. You now have a production-grade observability platform that no one else can see, with no trace limits, running for the cost of the server.

For production you’ll want to put it behind HTTPS and back up Postgres and ClickHouse, but for development the compose file is genuinely one command. The official self-hosting docs cover the Kubernetes Helm chart and managed-database setups when you outgrow single-node.

Instrumenting Your App: Three Ways

Langfuse offers progressively deeper levels of instrumentation. Start with the first one; reach for the others as your app grows. All three send data to the same project — set these environment variables once and every example below works against either Cloud or your self-hosted instance:

export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
# Cloud EU: https://cloud.langfuse.com  |  Cloud US: https://us.cloud.langfuse.com
# Self-hosted: http://localhost:3000
export LANGFUSE_HOST="https://cloud.langfuse.com"

Way 1: The OpenAI Drop-In Wrapper (zero refactor)

If your code already uses the OpenAI SDK — which is true for most free AI APIs, since Groq, Together, Mistral, and OpenRouter are all OpenAI-compatible — you change exactly one import line:

pip install langfuse openai

# before:  from openai import OpenAI
from langfuse.openai import openai   # drop-in replacement

client = openai.OpenAI(
    base_url="https://api.groq.com/openai/v1",   # any OpenAI-compatible endpoint
    api_key="YOUR_GROQ_KEY",
)

resp = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain LLM observability in one sentence."}],
)
print(resp.choices[0].message.content)
# This call is now automatically traced in Langfuse: prompt, completion,
# token usage, latency, and cost — with zero other changes.

Every completion you make now shows up as a trace. This is the lowest-effort way to start and works against any OpenAI-compatible free API.

Way 2: The @observe Decorator (capture your own functions)

To see your business logic — not just the model call — wrap any function with the @observe decorator. Nested decorated functions automatically become nested spans in the same trace:

from langfuse import observe
from langfuse.openai import openai

@observe()
def retrieve(question: str) -> str:
    # your vector search here; return the context string
    return "...retrieved context..."

@observe()
def answer(question: str) -> str:
    context = retrieve(question)            # becomes a child span
    resp = openai.chat.completions.create(  # becomes another child span
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Answer using:\n{context}"},
            {"role": "user", "content": question},
        ],
    )
    return resp.choices[0].message.content

answer("What does Langfuse trace?")
# One trace, three spans: answer -> retrieve, answer -> openai call.

Way 3: The LangChain / LangGraph Callback

If you build with LangChain or LangGraph, pass Langfuse’s callback handler and it captures the whole chain automatically:

from langfuse.langchain import CallbackHandler

handler = CallbackHandler()
result = chain.invoke(
    {"question": "What is LLM observability?"},
    config={"callbacks": [handler]},
)

For TypeScript / Node projects, the same drop-in pattern exists:

import OpenAI from "openai";
import { observeOpenAI } from "langfuse";

const client = observeOpenAI(new OpenAI({
  baseURL: "https://api.groq.com/openai/v1",
  apiKey: process.env.GROQ_API_KEY,
}));

const resp = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [{ role: "user", content: "Hello from Node, traced by Langfuse." }],
});

Because Langfuse v3 is built on OpenTelemetry under the hood, any OTel-instrumented library or framework can also feed it — useful if you’re standardizing telemetry across services. Check the Langfuse docs for the current SDK API, which evolves between major versions.

Tracing a RAG Pipeline End-to-End

RAG is where observability earns its keep, because a wrong answer can come from retrieval or generation and the two failure modes look identical from the outside. Picture a typical stack: a question comes in, you embed it, search a vector database, rerank with Cohere, stuff the top chunks into a prompt, and generate an answer.

With each step wrapped in @observe, a single Langfuse trace shows you:

The exact query embedding step and its latency
The documents retrieved from the vector store, with their similarity scores — so you can instantly see if retrieval pulled garbage
The reranked order after Cohere, to confirm the reranker actually helped
The final prompt that went to the model, including exactly which chunks made it into the context window
The completion, token count, and cost

When a user reports “it said we don’t offer refunds, but we do,” you open their trace and the answer is right there: either the refund policy chunk wasn’t retrieved (a retrieval/embedding problem) or it was retrieved but the model ignored it (a prompt problem). Five seconds of looking replaces an hour of guessing. That single capability — being able to see which half of the RAG pipeline failed — is the most common reason teams adopt Langfuse.

Prompt Management Without Redeploys

Once your prompts live in Langfuse, you fetch them by name at runtime:

from langfuse import Langfuse

langfuse = Langfuse()
prompt = langfuse.get_prompt("support-agent")   # fetches the 'production' label by default
compiled = prompt.compile(customer_name="Ada", product="Widget Pro")

# use compiled as your system prompt; cached client-side, linked to the trace

Now editing the support agent’s behavior is a UI change, not a code change. Non-engineers can iterate on copy, you can roll back a bad version with one click, and because Langfuse links each prompt version to the traces and scores it produced, you get a real before/after on quality instead of vibes. Prompts are cached on the client so the fetch doesn’t add latency to your hot path.

Running Evaluations

The maturity curve for an AI app usually goes: ship it, watch traces, notice a recurring failure, turn that failure into a dataset entry, then run evaluations so the failure can’t silently come back. Langfuse supports all of it:

Online evaluation — run an LLM-as-a-judge evaluator on a sample of live traffic and chart the score over time.
Offline evaluation — run your app against a fixed dataset before every release and diff the scores against the last run.
Human annotation — queue traces for a teammate to label in the UI, building a gold-standard set.

The judge model can be any provider you connect — including a free one. Using Gemini or a Llama model on Groq as your evaluator keeps the whole eval loop at $0, which matters because evaluation can easily run more model calls than production itself.

When to Use Langfuse vs Alternatives

You want one open-source platform for tracing + prompts + evals, with the option to self-host free → Langfuse
You are all-in on LangChain / LangGraph and want the tightest native integration → LangSmith
You debug mostly in Jupyter notebooks and care most about evals → Arize Phoenix
You want the absolute lowest-effort logging and only call one OpenAI-compatible API → Helicone (proxy, one URL change)
You have strict data-residency rules and prompts can’t leave your network → self-hosted Langfuse or Phoenix
You’re prototyping today and want zero setup → Langfuse Cloud Hobby (free, no card)

FAQ

Is Langfuse really free?

Yes, two ways. The MIT-licensed core is free to self-host with no trace, seat, or feature caps (a few enterprise extras like SSO enforcement need a commercial license). Langfuse Cloud also has a free Hobby tier with 50,000 units/month and no credit card. You only pay if you want managed hosting above the free volume or enterprise governance features.

Does Langfuse add latency to my app?

Negligibly. The SDK sends trace data asynchronously in the background after your response is already returned, and prompts are cached client-side. Your users don’t wait on Langfuse.

Do I have to use LangChain to use Langfuse?

No — that’s the point. Langfuse is framework-agnostic. The OpenAI drop-in wrapper and the @observe decorator work with plain SDK calls, CrewAI, LlamaIndex, raw HTTP, or your own custom orchestration. LangChain is just one of many supported integrations.

What’s the difference between Langfuse and LangSmith?

LangSmith is a closed-source managed product from the LangChain team, with the deepest hooks into the LangChain ecosystem. Langfuse is open-source, can be self-hosted for free, and is deliberately framework-agnostic. If you’re not married to LangChain — or you need to keep data on your own infrastructure — Langfuse is the more flexible choice.

Can I use Langfuse with free APIs like Gemini, Groq, or DeepSeek?

Yes. Any OpenAI-compatible endpoint works with the drop-in wrapper — just set the base_url. Groq, Together, DeepSeek, Mistral, and OpenRouter all qualify, and Gemini works through its OpenAI-compatible layer or a dedicated integration.

Does Langfuse store my prompts and completions?

Yes — that’s how tracing works. On Cloud, that data lives in Langfuse’s chosen region (EU or US). If your prompts contain sensitive data you can’t send to a third party, self-host: then the data never leaves your infrastructure. The SDK also supports masking specific fields before they’re sent.

Can it track cost?

Yes. Langfuse computes per-call token usage and a dollar estimate based on each model’s pricing, then aggregates it into dashboards by day, model, user, or feature — so you can find your most expensive endpoint at a glance.

What database does self-hosted Langfuse need?

The current architecture uses Postgres for transactional data and ClickHouse for high-volume trace analytics, plus Redis for queuing. The official Docker Compose file provisions all of them, so you don’t assemble it by hand.

Use Langfuse with OpenClaw

OpenClaw is an AI agent platform for orchestrating multi-step automated workflows — exactly the kind of long-running, multi-call system where a single failed step is otherwise invisible. Pointing OpenClaw’s model calls at the Langfuse-wrapped client gives every automated run a full trace tree.

A practical pairing: OpenClaw runs an unattended nightly pipeline (summarize new tickets, draft responses, flag anomalies). Each run is one Langfuse trace, with a span for every model call and tool use. In the morning you don’t re-read logs — you scan the Langfuse dashboard for any trace with a low score or an error span, open just those, and see exactly which step went sideways. Wire OpenClaw and Langfuse to the same free OpenRouter or Gemini key and the whole observe-and-iterate loop costs nothing.

Final Verdict

Langfuse is the right default in 2026 for anyone shipping an LLM app who has been burned by a bug they couldn’t reproduce. It captures the full trace tree, manages your prompts as versioned objects, and runs evaluations — and it does all of that as an open-source platform you can self-host for free with no caps, or run on a free Cloud tier in ninety seconds. The framework-agnostic SDK means it fits whatever stack you already have, and the OpenAI drop-in wrapper means your first trace is one import line away.

LangSmith is the smoother ride if you live entirely in LangChain, Phoenix is the notebook-native choice for evals, and Helicone wins on pure zero-effort logging. But for the broadest combination of openness, features, and a real free path, Langfuse is the one to install first. Spin up the Docker stack or grab a Cloud Hobby key, change one import in your app, and watch your first trace appear — then ask yourself how you ever debugged AI without it.

DEV Community