Langfuse Has a Free API: The Open-Source LLM Observability Platform That Tracks Every Token, Prompt, and Cost of Your AI App

#ai #webdev #opensource #api

Your AI app is in production. Users complain about bad responses but you can't reproduce the issue. You don't know which prompts perform best, how much each conversation costs, or why latency spiked at 3pm. Langfuse is the missing observability layer for LLM applications.

What Langfuse Actually Does

Langfuse is an open-source LLM engineering platform. It traces every LLM call in your application — capturing inputs, outputs, latency, token usage, cost, and user feedback. Think Datadog but specifically designed for AI applications.

The platform provides: tracing (follow a request through your entire LLM chain), prompt management (version and A/B test prompts), evaluations (automated quality scoring), analytics (cost per user, latency percentiles, token usage trends), and datasets (build test sets from production data).

Langfuse integrates with OpenAI, Anthropic, LangChain, LlamaIndex, and any custom LLM setup. Self-hosted (free, open-source MIT) or Langfuse Cloud (free tier: 50K observations/month).

Quick Start

pip install langfuse

Drop-in OpenAI replacement (zero code changes):

from langfuse.openai import openai

# Use exactly like the OpenAI SDK — Langfuse traces automatically
client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    metadata={"user_id": "user-123", "session_id": "sess-456"}
)
# Langfuse captures: prompt, response, tokens, cost, latency — automatically

Manual tracing for complex chains:

from langfuse import Langfuse

langfuse = Langfuse()

trace = langfuse.trace(name="rag-pipeline", user_id="user-123")

# Trace retrieval step
span = trace.span(name="vector-search")
docs = vector_db.search(query, top_k=5)
span.end(output={"doc_count": len(docs)})

# Trace LLM call
generation = trace.generation(
    name="answer-generation",
    model="gpt-4",
    input=[{"role": "user", "content": query}],
    model_parameters={"temperature": 0.7}
)
response = openai_client.chat.completions.create(...)
generation.end(output=response.choices[0].message.content)

3 Practical Use Cases

1. Cost Tracking Per User

# Tag every LLM call with user info
trace = langfuse.trace(
    name="chat",
    user_id=user.id,
    metadata={"plan": user.plan, "feature": "code-review"}
)

# In Langfuse dashboard:
# - Cost per user per day
# - Token usage by feature
# - Which users consume the most
# - ROI: cost vs user's subscription revenue

2. Prompt A/B Testing

# Fetch prompt from Langfuse (versioned, A/B testable)
prompt = langfuse.get_prompt("summarizer", label="production")

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "system", "content": prompt.compile(max_length=200)}],
    langfuse_prompt=prompt  # Links trace to prompt version
)

Update prompts in Langfuse dashboard — no code deploys. Compare v1 vs v2 with real production data.

3. Automated Evaluation

# Score responses automatically
trace.score(
    name="relevance",
    value=evaluate_relevance(query, response),  # Your custom eval
    comment="Automated relevance check"
)

# Or let users score
trace.score(
    name="user-feedback",
    value=1,  # Thumbs up
    comment="User clicked helpful"
)

Track quality metrics over time. Alert when scores drop.

Why This Matters

Running LLMs in production without observability is like running a web app without logging. You can't debug issues, optimize costs, or improve quality. Langfuse gives you the telemetry you need with minimal integration effort. The drop-in OpenAI wrapper means you can start tracing in 2 minutes.

Need custom data extraction or web scraping solutions? I build production-grade scrapers and data pipelines. Check out my Apify actors or email me at spinov001@gmail.com for custom projects.

Follow me for more free API discoveries every week!

Top comments (1)

Void Stitch • May 21

Useful walkthrough. One diagnostic question from a live Langfuse bug thread (#8020): Gemini 2.5 Flash reports prompt=2244, candidates=21, total=2265, while Langfuse usage in the same report shows input=232 and output=2054. In your setups, do you map Google usage_metadata directly or rely on Langfuse provider parsing first? For tenant chargeback this split can skew USD attribution if reservation logic trusts prompt/output classes. Have you seen this mismatch in production, and which reconciliation rule has held up best?