Stop Stalking Your Crush, Stalk Your Agents Instead: A LangSmith Deep Dive: Part - 1

#agents #ai #monitoring #tooling

LangSmith is a monitoring and observability platform built by the creators of LangChain and LangGraph for tracing AI applications.

But before diving into LangSmith, let's first understand what observability and monitoring actually mean.

Observability is essentially keeping a close eye on your AI applications while they run — tracking exactly what input goes into each step and what output comes out of it. Take a RAG (Retrieval-Augmented Generation) application as an example. It's made up of several moving parts: vector stores, retrievers, documents, embeddings, and the LLM itself. Observability here means logging every single input and output as it flows between these components. This way, when something breaks (and something always breaks), you know exactly where to look instead of guessing.

Monitoring, although it sounds similar to observability, is actually a different concept. Monitoring is the process of tracking your system or application's metrics as a whole — things like latency across different runs, the cost of one end-to-end execution, and so on.

In short: Observability tells you what happened and why, while monitoring tells you how well things are performing overall. You need both — monitoring flags that something's wrong (say, latency spiked at 3 PM), while observability helps you drill down and find out exactly which component caused it.

Why do LLM apps specifically need this?

Traditional software is predictable. If you call a function with the same input, you get the same output — every single time. When something breaks, you add a print statement, check the logs, find the line, fix it. Done.
LLM applications don't work like that.The output is non-deterministic. The same prompt can produce a different response on every single run. So when a user complains "the answer was wrong," you can't just reproduce it and debug it. That exact run is gone — unless you logged it.
The pipeline has multiple steps, each of which can silently fail. Take a RAG application. When your app gives a wrong answer, where did it go wrong?

Did the vector store retrieve the wrong documents?
Did the retriever rank them poorly?
Did the prompt template stuff too much context in?
Did the LLM just hallucinate despite having the right context?
Or did the change in prompt caused it?

Without observability you can just guess and play catch up but not get the exact root cause of the problem. You'd have to manually test each component in isolation, which is slow and doesn't reflect what actually happened during that specific run. And if the workflow or application is complex containing lot of components, finding the issue would become a nightmare.

Every LLM call costs money — and in a multi-step pipeline, you might be making 5 to 10 LLM calls per user request without realizing it. Without monitoring, you have no idea which step is burning your budget. Is it the query rewriter? The summarizer? The final answer generator? You won't know until your API bill arrives.

This is exactly why LangSmith exists.

Let's learn about some core concepts of LangSmith -

Projects - A Project in LangSmith is simply a container for one of your AI applications. Every trace and run gets logged under a project so your data stays organized and separated.
Trace - A Trace represents one complete end-to-end execution of your application — from the moment a user sends an input to the moment your app returns a final response.

For example, a user asks: "What is the return policy?"
That single question triggers your entire RAG pipeline — retrieval, reranking, prompt construction, LLM call, response generation. All of that together, from start to finish, is one trace.

Runs - If a Trace is the full journey, a run is each individual step along the way.

Inside that one trace of "What is the return policy?", LangSmith breaks it down into runs:

Trace: "What is the return policy?"

Run 1: Embed the user query
Run 2: Retrieve documents
Run 3: Rerank documents
Run 4: Construct prompt
Run 5: LLM call

Setting up LangSmith

Step 1: Create a LangSmith Account
Head over to smith.langchain.com and sign up. Once you're in, navigate to Settings → API Keys and generate a new API key. Copy it somewhere safe.

Step 2: Create a Project
Once you're inside the LangSmith dashboard, create a new project. Give it a meaningful name that matches your application.

Step 3: Install the Package
pip install langsmith

Step 4: Make sure to have these environment variables

.env
LANGCHAIN_TRACING_V2=true  -- Turns tracing on/off — set to true to enable
LANGCHAIN_API_KEY=your-langsmith-api-key
LANGCHAIN_PROJECT=your-project-name -- Which project to send traces to

That's literally it. These three environment variables are all LangSmith needs to start capturing traces.

Let's see a simple langchain workflow in action tracked by LangSmith -

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Load environment variables
load_dotenv()

# Prompt to generate a detailed report
prompt1 = PromptTemplate(
    template="Generate a detailed report on {topic}",
    input_variables=["topic"],
)

# Prompt to summarize the report
prompt2 = PromptTemplate(
    template="Generate a 5 pointer summary from the following text:\n\n{text}",
    input_variables=["text"],
)

# Initialize models
model1 = ChatOpenAI(model="gpt-4o-mini")
model2 = ChatOpenAI(model="gpt-4o")

# Output parser
parser = StrOutputParser()

# Create sequential chain
chain = (
    prompt1| model1| parser| prompt2| model2| parser
)

# Configuration for tracing
config = {
    "tags": ["llm_app", "report generation", "summarization"]
}

# Invoke chain
result = chain.invoke(
    {"topic": "Artificial Intelligence"},
    config=config,
)

print(result)

The screenshot above shows a real LangSmith trace for a simple Sequential LLM application.

The middle panel breaks down the six runs inside this trace — PromptTemplate formatted the input, gpt-4o-mini made the first LLM call (13.68s, 1.1K tokens), StrOutputParser cleaned the output, then the chain continued with another PromptTemplate, a second LLM call via gpt-4o (4.85s, 1.4K tokens), and a final StrOutputParser. On the right, you can see the exact input (topic: AI Opportunity in India) and the structured output the LLM returned.

This is exactly what makes LangSmith powerful. When something goes wrong, you don't guess — you just open the trace and see precisely where it broke.

LangSmith works out of the box with LangChain and LangGraph — no extra setup needed. However, if your pipeline includes components that aren't natively part of these frameworks, LangSmith won't trace them automatically. For those cases, you can wrap the function with the @traceable decorator and LangSmith will capture it just like any other run.

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable  # LangSmith will trace this function
def call_llm(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

result = call_llm("What is RAG in AI?")
print(result)

The @traceable decorator tells LangSmith — "treat this function as a run, log its input and output." Works with any Python function, any framework.

Alternatives of LangSmith

Langfuse
Langfuse is open source and self-hostable, meaning your data never leaves your own infrastructure. It works with any LLM framework — not just LangChain — and comes with prompt versioning and evaluation built in. Best choice if data privacy is a concern.

Helicone
Helicone works as a proxy between your app and the LLM provider. You change one base URL and it automatically starts logging every request — tokens, cost, latency. No code changes needed. Best for teams who just want clean cost and usage visibility without a full observability setup.

Arize Phoenix
Phoenix runs completely locally — no cloud, no data sharing. It goes beyond just tracing, offering embeddings visualization and dataset analysis. Best suited for ML teams doing serious evaluation or fine-tuning work alongside production monitoring.

Conclusion

Let's zoom out and look at what we covered -

We started with a simple truth — LLM applications are fundamentally different from traditional software. They're non-deterministic, multi-step, and fail silently. A wrong answer doesn't throw an error. It just quietly erodes your user's trust until they stop using your product.

Observability and monitoring are your defense against that. Observability tells you what happened and why at every step. Monitoring tells you how well your system is performing over time. You need both. LangSmith gives you both — wrapped in a clean UI that integrates natively with LangChain and LangGraph with almost zero setup effort.

The best time to add observability to your LLM app is before you need it — because by the time something breaks in production and a user is complaining, you'll wish you had the trace from that exact run.

So stop stalking your crush and start stalking your agents. Every trace tells a story, and the better you understand those stories, the easier it becomes to build reliable AI applications.

In Part 2, we'll go beyond tracing and explore how LangSmith helps you evaluate prompts, create datasets, run experiments, collect user feedback, and continuously improve your LLM applications.