Julio Molina Soler

Posted on May 16

Langfuse v4 + Ollama: Open-Source Tracer for Local LLMs (no OTLP, no mocks)

#observability #python #llm #ai

Disclosure: I learn topics like this through LLM dialogue. The prompts are mine, the depth comes from the model, the verification comes back to me, and I publish the result so others don't have to start from zero.

Repo: github.com/jmolinasoler/langfuse-ollama — MIT, Python 3.10+, no native OTLP exporter, no monkey-patching of Ollama, no MagicMock chains in the test suite.

Four files. One langfuse.openai.OpenAI import. Every local Ollama chat turn lands in Langfuse with session_id, user_id, tags, token counts, and reconstructed stream chunks — including from ollama serve running on localhost:11434. Streamlit UI, CLI, dependency-injected tests, all under MIT.

If you just want to run it:

git clone https://github.com/jmolinasoler/langfuse-ollama.git
cd langfuse-ollama
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # fill in LANGFUSE_PUBLIC_KEY / SECRET_KEY
streamlit run app.py

The rest of this post explains why each design choice exists, so you can fork it for your own provider (vLLM, LiteLLM, TGI, anything OpenAI-compatible) without re-discovering the v4 OTel migration footguns.

1. Why a wrapper, not native OTLP

Ollama exposes no OTLP endpoint of its own. The two real options are:

Manually instrument the HTTP client with OpenTelemetry, then ship spans to Langfuse via OTLP ingest.
Use the OpenAI-compatible endpoint Ollama already serves at /v1 and wrap the OpenAI client with Langfuse's drop-in subclass.

The repo uses option 2. The entire integration surface is in ollama_client.py:

from langfuse.openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by openai-python, ignored by Ollama
)

resp = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Explain MiCA Article 16"}],
)

The subclass intercepts .create() calls, opens a Langfuse generation, attaches input/output, counts tokens, measures latency, and closes the span — for both streaming and non-streaming responses. Token usage comes from the response payload, not middleware estimation. Stream chunks get reassembled inside the wrapper, not in application code.

For local LLMs specifically: no spans get lost because there is no separate exporter to crash, and there is nothing to instrument in the Ollama binary itself.

2. The Langfuse v4 OTel context migration (where most people break things)

Langfuse v4 is built on OpenTelemetry. This changes how trace-level metadata gets attached. Pre-v4, you passed session_id, user_id, and tags as kwargs to .create(). In v4 those fields live in the OTel context, set via propagate_attributes():

from langfuse.openai import OpenAI
from langfuse import propagate_attributes

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

with propagate_attributes(
    session_id="sess-uuid-here",
    user_id="alice",
    tags=["llama3.1", "defi-research"],
):
    resp = client.chat.completions.create(
        model="llama3.1",
        messages=messages,
        name="ollama-chat",  # this one IS a Langfuse-specific kwarg
    )

The rule:

Field	Where to put it
`session_id`, `user_id`, `tags`, `metadata`	`propagate_attributes()` context
`name` (trace/generation name)	`.create()` kwarg
OpenAI-native fields (`model`, `messages`, `temperature`, `max_tokens`)	`.create()` kwarg

Passing session_id directly to .create() in v4 either gets dropped or surfaces as a generation-level metadata key — not a trace-level session. The call still succeeds. The trace still shows up. Multi-turn conversations just stop grouping. This is the single most common migration footgun.

The repo encodes this rule directly in chat_complete() and chat_stream() so callers cannot accidentally pass session metadata to the wrong place.

3. Stream reconstruction in a single trace

Streaming responses present an obvious instrumentation problem: each chunk is a separate yield, but you want one trace with the full reconstructed output. The wrapper handles this transparently — but only if you drain the iterator inside the context. From ollama_client.py:

def chat_stream(client, model, messages, **kwargs):
    with propagate_attributes(
        session_id=kwargs.pop("session_id"),
        user_id=kwargs.pop("user_id"),
        tags=kwargs.pop("tags", []),
    ):
        stream = client.chat.completions.create(
            model=model,
            messages=messages,
            stream=True,
            name=kwargs.pop("trace_name", "ollama-stream"),
            **kwargs,
        )
        for chunk in stream:
            delta = chunk.choices[0].delta.content or ""
            yield delta

Two non-obvious requirements that the repo's structure enforces:

The propagate_attributes context must wrap the entire stream consumption, not just the .create() call. Exiting the context before the iterator drains causes attribute loss on later chunks.
Do not wrap the generator in list(...) for "convenience" inside the context — that defeats streaming. Accumulate downstream.

4. Lazy imports + dependency injection (why the tests don't need mocks)

The module-level import problem: from langfuse.openai import OpenAI triggers SDK initialization, which validates credentials and opens an OTel exporter. Fine in production, fatal in CI.

The fix — defer the import, inject the client:

def chat_complete(messages, model, *, client=None, **kwargs):
    if client is None:
        from langfuse.openai import OpenAI  # lazy
        client = OpenAI(
            base_url=os.environ["OLLAMA_BASE_URL"] + "/v1",
            api_key="ollama",
        )
    return client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )

Tests in tests/test_ollama_client.py pass a fake client with a .chat.completions.create() shape. No unittest.mock.patch, no MagicMock chains, no module-level monkey-patching:

class FakeChatCompletions:
    def create(self, **kwargs):
        return SimpleNamespace(
            choices=[SimpleNamespace(message=SimpleNamespace(content="ok"))]
        )

class FakeClient:
    chat = SimpleNamespace(completions=FakeChatCompletions())

def test_chat_complete_returns_content():
    resp = chat_complete([{"role": "user", "content": "hi"}], "llama3.1", client=FakeClient())
    assert resp.choices[0].message.content == "ok"

Faster than mock-based tests (no import-time side effects to suppress), survives SDK upgrades that rename internals, and runs entirely on python -m unittest discover — no pytest, no fixtures, no plugins.

5. What ends up in Langfuse

For every .create() call inside a propagate_attributes block, the wrapper emits:

Field	Source
`session_id`	OTel context (groups multi-turn conversations)
`user_id`	OTel context
`tags`	OTel context (list[str])
`name`	`.create()` kwarg, defaults to `ollama-chat`
Input messages	`.create()` `messages` argument, full array
Output content	Response `choices[0].message.content`, or reconstructed from stream
Input/output tokens	Response `usage` field, when Ollama returns it
Latency	Wall-clock between `.create()` entry and final chunk
Model	`.create()` `model` argument, verbatim

Token counts depend on Ollama returning a usage block — newer Ollama versions do, older ones return zeros. If tokens read as 0, upgrade Ollama before debugging the wrapper.

6. CLI surface for batch eval runs

For replaying fixture sets, A/B-ing prompts across models, or driving a leaderboard run, trace_cli.py composes the same client with argparse:

python trace_cli.py \
  --model llama3.1 \
  --prompt "Summarize ERC-4626" \
  --user-id alice \
  --trace-name "defi-research" \
  --tags "defi,erc4626" \
  --temperature 0.5

Each invocation gets a fresh session_id (UUID) by default; pass a shared one to group multiple invocations into one Langfuse session. This is the pattern for batch evaluation runs where every prompt in a fixture file should show up under a single session for aggregate scoring.

7. Self-hosted Langfuse: one env var

LANGFUSE_BASE_URL=http://localhost:3000   # self-hosted
# LANGFUSE_BASE_URL=https://cloud.langfuse.com   # EU cloud
# LANGFUSE_BASE_URL=https://us.cloud.langfuse.com   # US cloud

The wrapper reads this at SDK init. If you swap the URL mid-process, the existing client keeps the old endpoint — instantiate a new OpenAI(...) after the swap.

Action list

Clone the repo and run it against your own Ollama instance: github.com/jmolinasoler/langfuse-ollama. Fork it for vLLM, LiteLLM, TGI, or any OpenAI-compatible backend — the wrapper code path is identical.
In your own projects, replace any direct openai.OpenAI import with langfuse.openai.OpenAI for any OpenAI-compatible endpoint.
Move session_id, user_id, and tags out of .create() kwargs and into a propagate_attributes() block — anything left on .create() in v4 is silently downgraded to metadata.
Wrap the entire stream consumption in the context manager, not just the .create() call.
Defer the SDK import to function bodies and accept an injected client argument; tests get faster and survive SDK refactors.
Verify Ollama is on a version that returns a populated usage block before debugging zero-token traces.
For batch eval runs, share a single session_id across invocations so aggregate scoring groups correctly in the Langfuse UI.

Issues, forks, and PRs welcome on the repo.

DEV Community