DEV Community

RajeevaChandra
RajeevaChandra

Posted on

How ๐‹๐š๐ง๐ ๐’๐ฆ๐ข๐ญ๐ก, ๐‹๐š๐ง๐ ๐†๐ซ๐š๐ฉ๐ก, ๐Ž๐ฅ๐ฅ๐š๐ฆ๐š & ๐…๐€๐ˆ๐’๐’ Gave Me End-to-End Observability in a Local AI Chatbot

I just built a self-contained AI chatbotโ€”no cloud dependencies, no API keys, just pure local power with ๐‹๐š๐ง๐ ๐†๐ซ๐š๐ฉ๐ก, ๐Ž๐ฅ๐ฅ๐š๐ฆ๐š, ๐…๐€๐ˆ๐’๐’, ๐š๐ง๐ ๐‹๐š๐ง๐ ๐’๐ฆ๐ข๐ญ๐ก.

Tech Stack:

๐Ÿค– Build a fully local AI chatbot that answers questions from uploaded documents, with zero cloud dependencies.
๐Ÿ”ง LangChain: Orchestrate the chatbot logic using modular, state-based workflows (e.g., retrieve โ†’ generate โ†’ feedback).
๐Ÿ“š Ollama + FAISS + TF-IDF: Run a local llama3 model for response generation, and use TF-IDF + FAISS for fast, document-based context retrieval.
๐Ÿ–ฅ Streamlit: Provide an interactive web interface where users can upload files and chat with the bot in real time.
๐Ÿ“Š LangSmith: Enable full observability โ€” trace queries, inspect prompts, monitor latency, and analyze errors or retrieval issues end-to-end.

At first, it answered questions well enough.
But the real game-changer?
I could trace every step of its reasoning.

๐‹๐š๐ง๐ ๐’๐ฆ๐ข๐ญ๐ก gave me the transparency I never knew I needed, revealing the exact document chunks retrieved, the prompts fed to the model, execution times, and even where things went off track.

๐Ÿšง The Problem: โ€œIt Worksโ€ Isnโ€™t Enough

At first, my chatbot seemed to be doing well โ€” it returned reasonable answers to most questions. But thenโ€ฆ weird things started to happen..
โ— Was the wrong chunk retrieved?
โ— Was the prompt malformed?
โ— Did the model hallucinate?

Without insight into what was happening step by step, debugging was pure guesswork.

๐‡๐จ๐ฐ ๐‹๐š๐ง๐ ๐’๐ฆ๐ข๐ญ๐ก ๐‡๐ž๐ฅ๐ฉ๐ž๐ ๐Œ๐ž ๐ƒ๐ž๐›๐ฎ๐  ๐š๐ง๐ ๐ˆ๐ฆ๐ฉ๐ซ๐จ๐ฏ๐ž

โ‡’ Tracing
โ— View each query, chunk retrieval, and LLM response in real-time
โ— Confirm the right part of the document was being used
โ— Inspect the exact prompt given to llama3 via Ollama

Tracing

โ‡’ Error Analysis
โ— Trace misfires back to irrelevant or empty document chunks
โ— Compare expected vs. actual outputs
โ— Catch malformed inputs or slow model responses

โ‡’ Performance Metrics
โ— Track latency for each step (retriever, LLM)
โ— Identify slowdowns during Ollama inference
โ— Start tagging โ€œslowโ€ or โ€œretrieval_missโ€ runs for dashboards

Metrics

๐Ÿ“Š Scaling Visibility with LangSmith Dashboards

LangSmith doesnโ€™t just log traces โ€” it helps you monitor trends over time.
Using their dashboard tools, I now track:
๐Ÿง  Number of LLM calls
๐Ÿ•’ Average latency per query
๐Ÿ“‰ Retrieval failures
๐Ÿ’ธ Token usage (if using APIs like OpenAI or Anthropic)
โŒ Error Rates: Identify failed runs, exceptions, or empty prompts

Dashboard

Iโ€™ve published the full working project on GitHub โ€” complete with TF-IDF + FAISS retrieval, Ollama model integration, LangSmith observability, and a Streamlit interface.
https://lnkd.in/etpCMPiS

Top comments (0)