Your AI app is in production. Users complain about bad responses but you can't reproduce the issue. You don't know which prompts perform best, how much each conversation costs, or why latency spiked at 3pm. Langfuse is the missing observability layer for LLM applications.
What Langfuse Actually Does
Langfuse is an open-source LLM engineering platform. It traces every LLM call in your application — capturing inputs, outputs, latency, token usage, cost, and user feedback. Think Datadog but specifically designed for AI applications.
The platform provides: tracing (follow a request through your entire LLM chain), prompt management (version and A/B test prompts), evaluations (automated quality scoring), analytics (cost per user, latency percentiles, token usage trends), and datasets (build test sets from production data).
Langfuse integrates with OpenAI, Anthropic, LangChain, LlamaIndex, and any custom LLM setup. Self-hosted (free, open-source MIT) or Langfuse Cloud (free tier: 50K observations/month).
Quick Start
pip install langfuse
Drop-in OpenAI replacement (zero code changes):
from langfuse.openai import openai
# Use exactly like the OpenAI SDK — Langfuse traces automatically
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}],
metadata={"user_id": "user-123", "session_id": "sess-456"}
)
# Langfuse captures: prompt, response, tokens, cost, latency — automatically
Manual tracing for complex chains:
from langfuse import Langfuse
langfuse = Langfuse()
trace = langfuse.trace(name="rag-pipeline", user_id="user-123")
# Trace retrieval step
span = trace.span(name="vector-search")
docs = vector_db.search(query, top_k=5)
span.end(output={"doc_count": len(docs)})
# Trace LLM call
generation = trace.generation(
name="answer-generation",
model="gpt-4",
input=[{"role": "user", "content": query}],
model_parameters={"temperature": 0.7}
)
response = openai_client.chat.completions.create(...)
generation.end(output=response.choices[0].message.content)
3 Practical Use Cases
1. Cost Tracking Per User
# Tag every LLM call with user info
trace = langfuse.trace(
name="chat",
user_id=user.id,
metadata={"plan": user.plan, "feature": "code-review"}
)
# In Langfuse dashboard:
# - Cost per user per day
# - Token usage by feature
# - Which users consume the most
# - ROI: cost vs user's subscription revenue
2. Prompt A/B Testing
# Fetch prompt from Langfuse (versioned, A/B testable)
prompt = langfuse.get_prompt("summarizer", label="production")
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "system", "content": prompt.compile(max_length=200)}],
langfuse_prompt=prompt # Links trace to prompt version
)
Update prompts in Langfuse dashboard — no code deploys. Compare v1 vs v2 with real production data.
3. Automated Evaluation
# Score responses automatically
trace.score(
name="relevance",
value=evaluate_relevance(query, response), # Your custom eval
comment="Automated relevance check"
)
# Or let users score
trace.score(
name="user-feedback",
value=1, # Thumbs up
comment="User clicked helpful"
)
Track quality metrics over time. Alert when scores drop.
Why This Matters
Running LLMs in production without observability is like running a web app without logging. You can't debug issues, optimize costs, or improve quality. Langfuse gives you the telemetry you need with minimal integration effort. The drop-in OpenAI wrapper means you can start tracing in 2 minutes.
Need custom data extraction or web scraping solutions? I build production-grade scrapers and data pipelines. Check out my Apify actors or email me at spinov001@gmail.com for custom projects.
Follow me for more free API discoveries every week!
Top comments (0)