LLM Observability: Debugging My Journaling Agent

#ai #llm #opensource #langfuse

Hello! I'd love to share the story of my journey building an 'AI Journaling Assistant.' This article is the first chapter: a story about my initial battle with a chaotic, looping agent and how observability tools helped me win.

But first, let me explain the motivation behind this project.

One of the things I love most is reflection. I have countless journals in different formats—quick notes on my phone, handwritten pages, video diaries, and chats with LLMs. They're a rich history of my thoughts, but they're scattered everywhere. My goal was to unite them, to build a single, private space where I could analyze my own thoughts and memories. Also I thought it was the perfect opportunity to work on a project that combines two of my greatest passions: journaling and data science.

The Setup

To begin, I needed a simple agent capable of handling a few basic tools. As my top priority in this project was privacy, my main requirement was to use a model that could run entirely locally on my machine. It was also important to me to use free and open-source tools, which led me to Ollama, where I started experimenting with several smaller models to power the agent's logic.

The Problem

The initial model integration went smoothly, but I hit my first major roadblock when I started implementing tools. The agent simply wouldn't trigger them! My first instinct for this "keep it simple" project was to debug the old-fashioned way: with print() statements.

Very soon my console was flooded with an unreadable wall of text, and manually tracing the agent's logic became a painful and not sufficient. Just look at this mess!

Messy console logs proved, that I needed a real observability tool. My requirements were: it had to be open-source, easy to set up with my local models, and have a clean, intuitive UI. After a quick search, I chose Langfuse. It passed all of my criteria, and that is a tool that’s built specifically for the kind of LLM-native problems I was facing, like tracing agentic chains and evaluating outputs.

The setup was simple. After a quick pip install langfuse and grabbing my API keys, I started with adding the @observe decorator on functions I wanted to track. This decorator automatically creates a trace and captures performance data.

For more detailed tracking, I added a few lines of code to explicitly log the inputs, outputs, and custom metadata for better visibility.

from langfuse import get_client, observe

@observe(name="journaling_chat")
async def chat(self, message: str) -> str:
    """Send a message to the agent and get a response."""

    response = await chat_with_agent(message, self.context)

    client = get_client()
        client.update_current_span(
            input=message,
            output=response,
            metadata={
                "conversation_length": len(self.context.conversation_history)
            }
        )

        return response

And here is the trace I got.

Also added context management to trace LLM generation.

with client.start_as_current_generation(
    name="ollama-request",
    model=self._model_name,
    input=messages
) as generation:
    async with httpx.AsyncClient(timeout=120) as http_client:
        response = await http_client.post(OLLAMA_API_URL, json=payload)
        response_data = response.json()

The Diagnosis

With full observability in place, I could finally see what was happening. My first attempt with a smaller model showed that tools weren't being triggered at all. Suspecting the model might be the issue, I switched to a more capable one. This created a new, even more dramatic problem: the agent got stuck in an infinite loop, calling the same tool over and over.

My first thought was that the problem had to be either the model's capability or a flaw in my prompt. After several hours of iterating on the prompt and switching models with no success, I decided to focus entirely on the Langfuse traces. It was then I finally spotted the real issue: I could see the tool executing successfully, but its response was never being added back into the chat history. The agent wasn't being stubborn, it was completely unaware of the tool's answer. The problem wasn't a flaw in the model's reasoning, but a bug in how I was handling the conversation history.

A simple code fix to correctly manage the chat history, and the result was a beautiful, clean trace that executed the tool once and returned the result.

Takeaways

This whole process of building and debugging my agent taught me a few lessons:

Don't hurry to blame the model. My first thought was to blame the unpredictability of the AI, assuming the model was flawed. But in the end, the problem wasn't the model's reasoning at all—it was a simple bug in how my application handled the chat history.
Don't underestimate the power of a good tool. At first, I resisted using observability tool because I thought simple print() statements would be faster—I didn't want to "waste time" on setup. However, realized, that the hours I was losing to guesswork could be saved in minutes by looking at a clear trace.

Top comments (1)

Not Operations • Sep 24

Very good App!