Taiwrash

Posted on Mar 23

I Added Langfuse to My RAG App and It Immediately Caught Two Bugs

#ai #llm #monitoring #rag

I want to tell you something that did not make it into the README.

I built zey-ollama-rag-lab to show off ZeroEntropy's retrieval quality against a base LLM, side by side in a web UI. The demo worked. I was happy with it.

Then I added Langfuse. Within five minutes of looking at real traces, I found two bugs I had no idea were there.

One was embarrassing. One was silent and dangerous. Both would have stayed hidden without observability on the pipeline.

Let me show you what I found and exactly how everything is set up.

What the project does

The app compares two paths for answering a question:

Base LLM path: the question goes straight to TinyLlama via Ollama. No context, no retrieval.
RAG path: ZeroEntropy retrieves relevant chunks from your indexed documents, reranks them using zerank-2, then TinyLlama generates an answer using that context.

Both answers appear side by side in a web UI. The whole thing runs locally. Your documents never leave your machine.

The claim the project makes is that the RAG path produces better answers. Langfuse is how you find out if that claim is actually true.

Running the project

The cleanest way to run this is by cloning the repo and using Docker. The Dockerfile handles everything: it installs Ollama inside the container, pulls TinyLlama, installs the Python dependencies, and starts the FastAPI server. You do not need Python installed locally or Ollama set up separately.

git clone https://github.com/Taiwrash/zey-ollama-rag-lab.git
cd zey-ollama-rag-lab

Create your .env file using the provided example:

cp .env-example .env

Open .env and fill in your keys:

ZEROENTROPY_API_KEY=your_zeroentropy_key
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com

Then build and run:

docker build -t zey-rag-lab .
docker run --env-file .env -p 8000:8000 zey-rag-lab

The app is at http://localhost:8000. That is it. One clone, one build, one run.

The Dockerfile installs Ollama directly inside the container, exposes both port 8000 for FastAPI and 11434 for Ollama, and sets OLLAMA_HOST=http://localhost:11434 so both services find each other. The entrypoint.sh script handles starting Ollama and the FastAPI app in the right order. You do not need to think about any of that, it just works.

Setting up Langfuse Cloud

For the observability side, I used Langfuse Cloud. Go to cloud.langfuse.com, create an account, create a project (mine is called rag-ranker-comp), and grab your public and secret keys from the Settings page.

Those three lines in your .env are all you need:

LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com

The SDK is already in requirements.txt, so it gets installed when Docker builds the image. No extra steps.

How the Langfuse integration works in the code

The integration is already in app.py. Here is how the three key pieces work.

The observe decorator

from langfuse import Langfuse, observe

lf = Langfuse()

Langfuse() reads your keys from environment variables automatically. Then @observe wraps any function to create a span in Langfuse:

@observe(name="base-llm-generation", as_type="generation")
def _base_generation(query: str) -> str:
    response = ollama_client.chat(
        model=MODEL,
        messages=[{"role": "user", "content": query}],
    )
    text = response["message"]["content"]

    lf.update_current_generation(
        metadata={"mode": "base"},
        usage_details={"total": response.get("eval_count", 0)},
    )
    return text

Note usage_details instead of usage. That is the Langfuse v4 SDK pattern. The token count from Ollama gets attached to the generation span.

Tracing ZeroEntropy retrieval

@observe(name="zeroentropy-retrieval")
def _retrieve_context(query: str, collection: str, k: int = 3):
    snippets = zclient.queries.top_snippets(
        collection_name=collection,
        query=query,
        k=k,
        reranker="zerank-2",
    )
    chunks = [{"content": s.content, "score": getattr(s, "score", None)}
              for s in snippets.results]

    lf.update_current_span(
        metadata={
            "collection": collection,
            "reranker": "zerank-2",
            "num_chunks": len(chunks),
            "top_score": chunks[0]["score"] if chunks else None,
        },
    )
    return "\n\n".join(c["content"] for c in chunks), chunks

Every retrieval call logs the collection name, how many chunks came back, the reranker used, and the top relevance score. When retrieval fails silently, you see exactly why.

Prompt versioning for the RAG system prompt

This part is easy to miss but genuinely useful:

@observe(name="rag-llm-generation", as_type="generation")
def _rag_generation(query: str, context_text: str) -> str:
    try:
        remote_prompt = lf.get_prompt("rag-system-prompt")
        system_prompt = remote_prompt.compile(context=context_text)
        lf.update_current_generation(prompt=remote_prompt)
    except Exception:
        system_prompt = (
            "You are a head of developer experience at ZeroEntropy. "
            "Use the following retrieved context to answer accurately. "
            "If the answer is not in the context, say you don't know.\n\n"
            f"Context:\n{context_text}"
        )

The app first tries to fetch a versioned prompt called "rag-system-prompt" from Langfuse. If you have created one in the Langfuse UI, it uses that. If not, it falls back to the hardcoded version. Either way, the trace records which prompt version generated which output. You can iterate on your system prompt from the Langfuse UI without touching code or rebuilding the container.

Human feedback endpoint

@app.post("/api/feedback")
async def feedback(data: FeedbackRequest):
    lf.score(
        trace_id=data.trace_id,
        name="user-preference",
        value=float(data.value),
        comment=f"preferred: {data.mode}",
    )

The /api/ask endpoint returns a trace_id with every response. When the user clicks a preference in the UI, that score gets attached to the exact trace it came from. Over many queries you start building real data on when RAG wins and when it does not.

What the traces actually showed me

I ran the container and asked "what is k8s?" a few times. Then I opened cloud.langfuse.com and looked at the traces.

Here is the raw export:

Traces 1 and 2, base-llm-generation:

{
  "name": "base-llm-generation",
  "input": "{\"args\": [\"what is k8s?\"], \"kwargs\": {}}",
  "output": null
}

Output is null. Twice. The trace captured the input arriving and the function running, but nothing coming back out. Without Langfuse I would have seen an empty box in the UI and assumed it was a frontend rendering bug. With the trace I knew immediately: the generation itself is returning nothing. That pointed straight at the Ollama response parsing.

Trace 3, base-llm-generation, after I dug into the code:

{
  "output": "Kubernetes (kurz für Kubernetes Engine oder K8s) ist ein Containerorchestrator..."
}

It answered. In German. For an English question. TinyLlama just decided to respond in German, confidently and fluently. In the UI this looks like text appeared so you think it worked. In the trace you can read the exact output and know it did not.

Trace 4, zeroentropy-retrieval:

{
  "name": "zeroentropy-retrieval",
  "input": "{\"args\": [\"what is k8s?\", \"demo_collection\"], \"kwargs\": {}}",
  "output": null
}

Retrieval ran and returned nothing. The reason showed up in the next trace.

Trace 5, rag-llm-generation:

{
  "input": "{\"args\": [\"what is k8s?\", \"[Retrieval error: Error code: 401 - {'detail': 'API Key Invalid'}]\"]}",
  "output": "Kubernetes (also known as K8s) is a cloud-native platform..."
}

The ZeroEntropy API key was invalid. Retrieval failed with a 401. But the app caught the exception, passed the error string as context, and TinyLlama just ignored it and answered from its own weights.

So the RAG path was silently running as a base LLM with extra steps. No crash. No visible error in the UI. The user sees a correct-looking answer and has no idea retrieval never happened. That is the kind of bug that erodes trust in a system slowly, because the answers look fine until someone checks carefully.

Langfuse caught both bugs in the first five minutes.

What the Langfuse dashboard shows

The home view after that first session showed five total traces in the past day. base-llm-generation appeared three times (I was debugging the base path). rag-llm-generation once. zeroentropy-retrieval once. Model costs at $0.00 because TinyLlama runs locally and free. Scores at zero because the feedback buttons in the UI are the next thing to wire up.

The trace breakdown in the bar chart already tells you something about what is happening in the app. Three base LLM traces to one RAG trace means the base path was the thing failing. If this were a production app with real users, that ratio would tell you which path people are actually using and whether the distribution matches what you expected.

The latency dashboard is where this becomes genuinely useful at scale. How long does zeroentropy-retrieval take on average compared to ollama-generation? If your RAG path is slower than the base path, is the retrieval or the generation the bottleneck? You get answers to those questions without adding any logging code.

What I fixed

The null outputs: the issue was in how I was reading the Ollama chat response. The message key was not always present on the response object in the way I expected. Seeing output: null in two consecutive traces made this obvious in a way that a missing text box in the UI never would have.

The German response: I added "Answer in English." to the base prompt. TinyLlama needs the explicit instruction. The traces now show consistent English output.

The 401 error: I updated .env-example in the repo to make it clearer that ZEROENTROPY_API_KEY has to be set before anything works. The silent failure was the real problem. The app should surface the retrieval error visibly in the UI rather than passing it as context to the LLM and pretending everything worked.

What is still left to do

The dashboard shows zero scores tracked. The /api/feedback endpoint is already in the backend code and working. I just have not added the thumbs up and thumbs down buttons to the UI yet.

Once those are wired up, every user preference flows into Langfuse as a score attached to a trace. Over enough queries you can actually answer the question this whole project is built around: how often does RAG beat the base model, on what kinds of questions, and what happens to that number when retrieval fails?

That is what turns a demo into a system you can measure and improve over time.

The full running setup

Clone, configure, build, run:

git clone https://github.com/Taiwrash/zey-ollama-rag-lab.git
cd zey-ollama-rag-lab
cp .env-example .env
# fill in ZEROENTROPY_API_KEY, LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY
docker build -t zey-rag-lab .
docker run --env-file .env -p 8000:8000 zey-rag-lab

App at http://localhost:8000. Langfuse dashboard at cloud.langfuse.com. Traces start flowing from the first query.

The Langfuse integration is already in the code. You just need the keys.

Find me on Twitter at @Taiwrash if you find something interesting in your traces.

Langfuse is open source. Cloud hosted at cloud.langfuse.com. Self-hostable if you need everything on-prem. Docs at langfuse.com/docs.

DEV Community