<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Austin Vance</title>
    <description>The latest articles on DEV Community by Austin Vance (@austinbv).</description>
    <link>https://dev.to/austinbv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F305023%2Fc978f899-9fa5-4b1e-9ad9-e3f60313fd65.jpeg</url>
      <title>DEV Community: Austin Vance</title>
      <link>https://dev.to/austinbv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/austinbv"/>
    <language>en</language>
    <item>
      <title>Your Customer Service Bot Is Slow Because It's Single-Threaded</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:16:24 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/your-customer-service-bot-is-slow-because-its-single-threaded-1gnb</link>
      <guid>https://dev.to/focused_dot_io/your-customer-service-bot-is-slow-because-its-single-threaded-1gnb</guid>
      <description>&lt;p&gt;Consider a typical enterprise support agent. A customer asks a complex compliance question and the agent dutifully queries the knowledge base, then searches the web, then checks policy docs. Sequential. Three LLM calls back to back. &lt;em&gt;That's ~12 seconds of wall time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Users start abandoning chat around 8.&lt;/p&gt;

&lt;p&gt;Fan out those three research calls in parallel, same calls, same models, same prompts, and &lt;em&gt;wall time drops to ~6.5 seconds.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This post covers the parallel sub-agent pattern using LangGraph and LangSmith. I'll show the code, but more importantly, I'll show you the failure modes because the pattern is simple and the bugs are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Latency Math
&lt;/h2&gt;

&lt;p&gt;You have an agent that needs to hit three sources, internal KB, web search, and policy documents. Each LLM call takes 2–4 seconds. Sequentially:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classify query&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research KB&lt;/td&gt;
&lt;td&gt;~3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research Web&lt;/td&gt;
&lt;td&gt;~3.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research Policy&lt;/td&gt;
&lt;td&gt;~2.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synthesize&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~12s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In parallel, the three research steps overlap:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classify query&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research (all three, parallel)&lt;/td&gt;
&lt;td&gt;~3.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synthesize&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~6.5s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 45% reduction from a structural change, not a prompt improvement. Every additional sub-agent you add sequentially costs another 2–4 seconds. In parallel, it's free, until you hit the slowest branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Parallel Agents Architecture
&lt;/h2&gt;

&lt;p&gt;We're building a research assistant that fans out to three parallel sub-agents, aggregates results, and synthesizes a response:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                     ┌→ [Research: KB]     ─┐
[Classify Query] ────┼→ [Research: Web]    ─┼→ [Synthesize] → END
                     └→ [Research: Policy] ─┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;LangGraph executes parallel branches in a superstep, all three branches run concurrently, state updates are transactional. The fan-in edge waits for all branches before proceeding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the Send API:&lt;/strong&gt; LangGraph has a &lt;code&gt;Send&lt;/code&gt; API for dynamic map-reduce where branch count is unknown at build time. Don't reach for it here. &lt;code&gt;Send&lt;/code&gt; is designed for running the same node N times with different inputs. For a fixed set of specialist agents, static edges or conditional routing are simpler, preserve graph structure, and keep every branch visible at compile time via &lt;code&gt;graph.get_graph().draw_mermaid()&lt;/code&gt;. In practice, you'll rarely need &lt;code&gt;Send&lt;/code&gt;. Start with static fan-out, graduate to conditional, reach for &lt;code&gt;Send&lt;/code&gt; as a last resort.&lt;/p&gt;

&lt;h2&gt;
  
  
  State: The One Thing You'll Get Wrong
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;Annotated[list, operator.add]&lt;/code&gt; reducer tells LangGraph to &lt;strong&gt;concatenate&lt;/strong&gt; results from parallel branches instead of overwriting them. Without it, parallel branches race to write the results field. The last branch to finish wins, and you silently lose the other two. This is one of the most common bugs in parallel agent systems. The synthesizer produces suspiciously narrow responses, coverage evals fail intermittently, and you spend two days blaming the prompt before realizing you're only getting one source's data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;State, a sub-agent factory, and three agent instances. The &lt;code&gt;@traceable&lt;/code&gt; decorator ensures each agent appears as a distinct span in LangSmith — this will be the single most important debugging decision you make.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import operator
from typing import Annotated, TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research_results: Annotated[list[dict], operator.add]
    final_response: str


def make_agent(name: str, focus: str):
    """Factory that builds a traceable research sub-agent."""

    @traceable(name=name, run_type="chain")
    def node(state: State) -&amp;gt; dict:
        response = llm.invoke([
            SystemMessage(content=f"You are the {name} agent. Focus on {focus}. "
                                  "Return a concise summary. Cite your source type."),
            HumanMessage(content=f"Research query: {state['question']}"),
        ])
        return {"research_results": [{"source": name, "content": response.content}]}

    return node


kb_agent = make_agent("knowledge_base", "internal knowledge base searches.")
web_agent = make_agent("web_search", "recent news and industry trends.")
policy_agent = make_agent("policy", "compliance, legal, and regulatory frameworks.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The synthesizer merges sub-agent outputs into one customer-facing response. The key constraint, worth knowing before you ship, is that policy information takes precedence. Without this, the synthesizer will cheerfully soften restrictions to sound more helpful.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="Synthesizer", run_type="chain")
def synthesize(state: State) -&amp;gt; dict:
    context = "\n\n".join(
        f"[{r['source']}]: {r['content']}" for r in state["research_results"]
    )
    response = llm.invoke([
        SystemMessage(
            content="Synthesize the following research into a clear, actionable "
                    "response. When policy information conflicts with or constrains "
                    "other responses, the policy statement takes precedence. "
                    "Never soften or omit policy restrictions."
        ),
        HumanMessage(
            content=f"Customer question: {state['question']}\n\n"
                    f"Research findings:\n{context}"
        ),
    ])
    return {"final_response": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Graph Assembly
&lt;/h2&gt;

&lt;p&gt;Fifteen lines of wiring. &lt;code&gt;RetryPolicy&lt;/code&gt; on every research node so a provider 429 doesn't kill the entire pipeline, successful branches are checkpointed and won't re-execute.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.graph import StateGraph, START, END
from langgraph.types import RetryPolicy

builder = StateGraph(State)

builder.add_node("kb", kb_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("web", web_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("policy", policy_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "kb")
builder.add_edge(START, "web")
builder.add_edge(START, "policy")
builder.add_edge(["kb", "web", "policy"], "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Conditional Routing: The Upgrade
&lt;/h2&gt;

&lt;p&gt;Sometimes hitting every source is wasteful. A simple "what's our refund policy?" doesn't need web search. Conditional fan-out lets you route based on the question using structured output, no regex parsing, no brittle string matching:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from collections.abc import Sequence

from pydantic import BaseModel, Field


class RoutingPlan(BaseModel):
    agents: list[str] = Field(
        description="Agents to activate: kb, web, policy"
    )

structured_llm = llm.with_structured_output(RoutingPlan)


def classify_and_route(state: State) -&amp;gt; Sequence[str]:
    plan = structured_llm.invoke([
        SystemMessage(content="Decide which research agents to invoke. "
                              "Available: kb, web, policy. When in doubt, include the agent."),
        HumanMessage(content=state["question"]),
    ])
    return plan.agents or ["kb"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The tradeoff is real. Conditional routing saves latency on simple queries but your routing logic becomes a new failure point. And with conditional fan-out, use individual edges from each node to &lt;code&gt;synthesize&lt;/code&gt; not the list-style fan-in or LangGraph waits forever for branches that were never dispatched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Failures in Concurrent Execution
&lt;/h2&gt;

&lt;p&gt;These are the failure modes that surface once parallel agents hit real traffic.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State Clobbering.&lt;/strong&gt; Synthesizer references only one source. Intermittent. Cause: missing &lt;code&gt;operator.add&lt;/code&gt; reducer. Parallel branches overwrite instead of appending. There's no warning, the graph runs fine, it just loses data.****&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesizer Contradicted the Policy Agent.&lt;/strong&gt; Say a customer asks about returning an opened product. The policy agent correctly stated the 30-day &lt;em&gt;unopened-only&lt;/em&gt; return policy. The KB agent mentioned "hassle-free returns." The synthesizer merged these into: "You can return the product within 30 days, hassle-free" omitting the unopened requirement. LangSmith traces showed the policy agent's output was correct; the synthesizer span revealed where the information was lost. Fix: the policy-takes-precedence constraint in the synthesizer prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hung Branch Blocking Fan-In.&lt;/strong&gt; Response times spike from ~6s to 30s+. The fan-in waits for ALL branches. Your p50 is fine, your p99 is determined by the slowest branch on its worst day. Fix: async timeouts per branch, return partial results (&lt;code&gt;{"source": "web_search", "content": "Timed out"}&lt;/code&gt;) rather than blocking the pipeline.****&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestrator Under-Dispatched&lt;/strong&gt;. A significant fraction of multi-domain queries will be only partially routed. Over-dispatching (an agent returning empty results) is cheap. Under-dispatching is a customer getting an incomplete answer. Fix: explicit multi-domain examples in the routing prompt and a &lt;code&gt;"when in doubt, include the agent"&lt;/code&gt; instruction.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Parallel agents are hard to debug without tracing. &lt;code&gt;@traceable&lt;/code&gt; on every sub-agent gives you per-branch spans in LangSmith. Tag production traces with metadata for filtering:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

with tracing_context(
    metadata={"customer_tier": "enterprise", "channel": "chat"},
    tags=["production", "v2"],
):
    result = graph.invoke({"question": "How does GDPR affect our data pipeline?"})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The first thing to check when latency spikes: is one branch consistently slower? LangSmith makes that a 10-second investigation instead of an hour of log-grepping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;Shipping without evals is negligence. Three evaluators catch the most common regressions: deterministic coverage, structural fan-out validation, and LLM-as-judge for overall quality.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="research-agent-evals",
    description="Parallel research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What is our refund policy for enterprise clients?"},
        {"question": "How does GDPR affect our data pipeline architecture?"},
        {"question": "What competitors launched AI features last quarter?"},
    ],
    outputs=[
        {"must_mention": ["refund", "enterprise", "policy"]},
        {"must_mention": ["GDPR", "data", "compliance"]},
        {"must_mention": ["competitor", "AI", "feature"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
Customer query: {inputs[question]}
AI response: {outputs[final_response]}

Rate 0.0-1.0 on completeness, accuracy, and tone.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the synthesizer actually address the question?"""
    text = outputs.get("final_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def source_diversity(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Is the fan-out actually working, or did it silently degrade?"""
    results = outputs.get("research_results", [])
    sources = {r["source"] for r in results if isinstance(r, dict)}
    return {"key": "source_diversity", "score": min(len(sources) / 2.0, 1.0)}


def target(inputs: dict) -&amp;gt; dict:
    return graph.invoke({"question": inputs["question"]})


results = evaluate(
    target,
    data="research-agent-evals",
    evaluators=[quality_judge, coverage, source_diversity],
    experiment_prefix="parallel-research-v1",
    max_concurrency=4,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;source_diversity&lt;/code&gt; is the only automated check that your parallel architecture is actually parallel. Without it, state clobbering can ship to production and sit there for weeks. Run this eval on every PR that touches agent code.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use parallel sub-agents when:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queries regularly span 2+ domains in a single message&lt;/li&gt;
&lt;li&gt;You need per-domain traceability for debugging and compliance&lt;/li&gt;
&lt;li&gt;Sub-agents have different tool sets or retrieval sources&lt;/li&gt;
&lt;li&gt;You're iterating on prompts and need isolated regression testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip it when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queries are single-domain (a FAQ bot doesn't need orchestration)&lt;/li&gt;
&lt;li&gt;Latency budget is extremely tight (routing adds one LLM call)&lt;/li&gt;
&lt;li&gt;You have fewer than 3 distinct knowledge domains&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Parallel sub-agents aren't architecturally complex it's a fan-out, a fan-in, and a reducer. The code is about 15 lines of graph wiring. The production hardening is everything else.&lt;/p&gt;

&lt;p&gt;Start with static fan-out. Add conditional routing when you have data showing which sources matter for which queries. Write the &lt;code&gt;source_diversity&lt;/code&gt; eval before you write the second prompt. And put &lt;code&gt;operator.add&lt;/code&gt; on your list fields you'll thank me later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/01-parallel-sub-agents/" rel="noopener noreferrer"&gt;Parrallel Agents Github Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/quickstart" rel="noopener noreferrer"&gt;LangGraph Quickstart (State, Reducers, Graph Construction)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/observability" rel="noopener noreferrer"&gt;LangSmith Observbaility &amp;amp; Tracing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/evaluation" rel="noopener noreferrer"&gt;LangSmith Evaluation Framework&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded" rel="noopener noreferrer"&gt;https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your AI Just Emailed a Customer Without Permission</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:16:21 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/your-ai-just-emailed-a-customer-without-permission-38k4</link>
      <guid>https://dev.to/focused_dot_io/your-ai-just-emailed-a-customer-without-permission-38k4</guid>
      <description>&lt;p&gt;In a customer complaint handler for a fintech company you have drafted responses, checked tone, and verified responses to match company policy. Automated from end to end. Then, the agent sends a $4,200 refund approval to a customer who'd asked about a fee schedule. The LLM hallucinates the complaint, writes up a professional apology with a specific dollar amount, and fires it off before anyone on the team even knows.&lt;/p&gt;

&lt;p&gt;Better prompts won’t help because the problem isn't what the model says, it's that nothing stops it from saying it.&lt;/p&gt;

&lt;p&gt;To fix this you need an approval gate. Somewhere in the agent’s graph where execution... stops. State gets written to disk and a human looks at the draft. Only after they say "yeah, send it" does anything go out the door. LangGraph has a built-in primitive for this called &lt;code&gt;interrupt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's walk through the full pattern here. The code is straightforward but state management can trip you up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost argument (if you need one)
&lt;/h2&gt;

&lt;p&gt;If you're already sold on why AI shouldn't email customers unsupervised, skip this, but if you need to convince your PM, here's some napkin math:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Without Gate&lt;/th&gt;
&lt;th&gt;With Gate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Messages sent/day&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rate (wrong tone/info)&lt;/td&gt;
&lt;td&gt;~3%&lt;/td&gt;
&lt;td&gt;~0.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bad messages/day&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg cost per bad message&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily risk&lt;/td&gt;
&lt;td&gt;$3,000&lt;/td&gt;
&lt;td&gt;$100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What we’re building&lt;/p&gt;

&lt;p&gt;A customer complaint response pipeline. Complaint comes in, AI drafts a response, a human approves or edits, system sends the final version.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Intake] → [Draft Response] → [INTERRUPT: Human Review] → [Send Response] → END
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;interrupt&lt;/code&gt; is where execution pauses. All the graph state (draft, original complaint, metadata, etc) gets checkpointed. It could be hours or days before someone reviews it and when they do, the graph will pick up right where it stopped.&lt;/p&gt;

&lt;p&gt;Even in serverless environments &lt;code&gt;interrupt&lt;/code&gt; is resilient. The Python process can crash. Server can restart. You resume with the same &lt;code&gt;thread_id&lt;/code&gt; and LangGraph reloads everything from the checkpointer. &lt;/p&gt;

&lt;h2&gt;
  
  
  The state schema
&lt;/h2&gt;

&lt;p&gt;Whatever the reviewer needs to see has to be in state before the interrupt fires.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    complaint: str
    customer_id: str
    draft_response: str
    review_decision: str
    reviewer_notes: str
    final_response: str
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The nodes
&lt;/h2&gt;

&lt;p&gt;Let’s build three nodes, draft, review, send. All with &lt;code&gt;@traceable&lt;/code&gt; because six months from now when someone asks "who approved sending that email to the VP of procurement at our biggest account," you want a trace showing what the AI wrote vs. what a person changed.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="draft_response", run_type="chain")
def draft_response(state: State) -&amp;gt; dict:
    response = llm.invoke([
        SystemMessage(
            content="You are a customer service agent. Draft a professional, "
                    "empathetic response to the following complaint. Be specific "
                    "about next steps. Do NOT promise refunds or credits unless "
                    "the complaint clearly warrants one. Keep it under 150 words."
        ),
        HumanMessage(
            content=f"Customer ID: {state['customer_id']}\n\n"
                    f"Complaint: {state['complaint']}"
        ),
    ])
    return {"draft_response": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The review node is where &lt;code&gt;interrupt()&lt;/code&gt; does its work.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.types import interrupt

@traceable(name="human_review", run_type="chain")
def human_review(state: State) -&amp;gt; dict:
    decision = interrupt({
        "draft": state["draft_response"],
        "customer_id": state["customer_id"],
        "complaint": state["complaint"],
        "instructions": "Review the draft. Respond with a JSON object: "
                        '{"action": "approve" | "edit" | "reject", '
                        '"edited_response": "...", "notes": "..."}'
    })
    return {
        "review_decision": decision["action"],
        "reviewer_notes": decision.get("notes", ""),
        "final_response": decision.get("edited_response", state["draft_response"])
            if decision["action"] != "reject" else "",
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The dict you pass to &lt;code&gt;interrupt()&lt;/code&gt; is the payload. It shows up in the &lt;code&gt;__interrupt__&lt;/code&gt; field of the graph's return value, which is what your UI or Slack bot reads to build the review screen. When someone calls &lt;code&gt;Command(resume={"action": "approve"})&lt;/code&gt;, that dict becomes what &lt;code&gt;interrupt()&lt;/code&gt; returns. The function resumes from the line right after the &lt;code&gt;interrupt()&lt;/code&gt; call. It looks like a normal function call but there's a checkpoint boundary hiding inside it.&lt;/p&gt;

&lt;p&gt;Send node. Don't send if it was rejected:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="send_response", run_type="chain")
def send_response(state: State) -&amp;gt; dict:
    if state["review_decision"] == "reject":
        return {"final_response": "[REJECTED] " + state["reviewer_notes"]}
    return {"final_response": state["final_response"]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Wiring it up
&lt;/h2&gt;

&lt;p&gt;The checkpointer makes interrupts durable. You can use &lt;code&gt;InMemorySaver&lt;/code&gt; for dev, &lt;code&gt;PostgresSaver&lt;/code&gt; for prod and if you forget the checkpointer and &lt;code&gt;interrupt()&lt;/code&gt; throws a &lt;code&gt;RuntimeError&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.checkpoint.memory import InMemorySaver
from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("draft", draft_response)
builder.add_node("review", human_review)
builder.add_node("send", send_response)

builder.add_edge(START, "draft")
builder.add_edge("draft", "review")
builder.add_edge("review", "send")
builder.add_edge("send", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The full interrupt/resume cycle
&lt;/h2&gt;

&lt;p&gt;Two &lt;code&gt;invoke&lt;/code&gt; calls. First one runs until the interrupt and stops, the second one picks up where it left off.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.types import Command

config = {"configurable": {"thread_id": "complaint-1234"}}

# Phase 1: Run until the interrupt
result = graph.invoke(
    {
        "complaint": "I was charged twice for my subscription last month. "
                     "Order #A-9912. I want a refund immediately.",
        "customer_id": "cust_8837",
    },
    config=config,
)

# The graph paused. Extract the interrupt payload.
interrupt_data = result["__interrupt__"][0].value
print(f"Draft for review: {interrupt_data['draft']}")
print(f"Customer: {interrupt_data['customer_id']}")

# Phase 2: Human reviews and approves (could be minutes or days later)
final_result = graph.invoke(
    Command(resume={
        "action": "edit",
        "edited_response": "We've identified the duplicate charge on Order #A-9912. "
                           "A refund of $29.99 has been initiated and will appear "
                           "in 3-5 business days. We apologize for the inconvenience.",
        "notes": "Verified duplicate charge in billing system. Approved refund.",
    }),
    config=config,  # Same thread_id — this is how LangGraph finds the checkpoint
)

print(f"Final response: {final_result['final_response']}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That &lt;code&gt;thread_id&lt;/code&gt; in the config matters more than anything else here. It's the key into the checkpointer. Without a &lt;code&gt;thread_id&lt;/code&gt; you can't resume. We treat these as primary keys and map it to something stable in your system: ticket ID, conversation ID, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding risk-based routing
&lt;/h2&gt;

&lt;p&gt;The basic version sends everything through human review. Start there, but eventually reviewers get tired of approving "thanks for contacting us, we're looking into it" all day, and you'll want to auto-approve the low-risk stuff.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pydantic import BaseModel, Field


class RiskAssessment(BaseModel):
    risk_level: str = Field(description="low, medium, or high")
    reason: str = Field(description="Why this risk level was assigned")


risk_llm = llm.with_structured_output(RiskAssessment)


@traceable(name="assess_risk", run_type="chain")
def assess_risk(state: State) -&amp;gt; dict:
    assessment = risk_llm.invoke([
        SystemMessage(
            content="Assess the risk level of this customer service response. "
                    "high = involves money, legal, account changes, or could "
                    "be interpreted as a binding commitment. "
                    "medium = emotional topic, could escalate. "
                    "low = simple acknowledgment, FAQ, status update."
        ),
        HumanMessage(
            content=f"Complaint: {state['complaint']}\n\n"
                    f"Draft response: {state['draft_response']}"
        ),
    ])
    return {"review_decision": assessment.risk_level}


def route_by_risk(state: State) -&amp;gt; str:
    if state["review_decision"] == "low":
        return "send"
    return "review"


builder_v2 = StateGraph(State)

builder_v2.add_node("draft", draft_response)
builder_v2.add_node("assess", assess_risk)
builder_v2.add_node("review", human_review)
builder_v2.add_node("send", send_response)

builder_v2.add_edge(START, "draft")
builder_v2.add_edge("draft", "assess")
builder_v2.add_conditional_edges("assess", route_by_risk, {"send": "send", "review": "review"})
builder_v2.add_edge("review", "send")
builder_v2.add_edge("send", END)

graph_v2 = builder_v2.compile(checkpointer=InMemorySaver())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Fair warning: you've now introduced a second LLM call as a gate, and that gate can be wrong in both directions. Under-classify risk and messages go out without review. Over-classify and reviewers are right back to rubber-stamping everything. Run the classifier in logging-only mode for a couple weeks first (route everything through review, but record what the classifier would have done and use long term memory to tune the classifier). Then start skipping reviews on low-risk messages after you trust the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bugs
&lt;/h2&gt;

&lt;p&gt;The demo works great... but...&lt;/p&gt;

&lt;h3&gt;
  
  
  Lost thread_id
&lt;/h3&gt;

&lt;p&gt;Someone approves a draft in Slack. The integration pulls out the approval decision but constructs a &lt;em&gt;new&lt;/em&gt; thread_id instead of looking up the one stored with the interrupt payload. Now &lt;code&gt;Command(resume=...)&lt;/code&gt; creates a fresh graph where the input is an approval decision, not the complaint. &lt;/p&gt;

&lt;p&gt;This happens a lot. Store the thread_id alongside the interrupt payload when you surface it to reviewers. Put it in a database. Put it in the Slack message metadata, Do not lose it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stale state
&lt;/h3&gt;

&lt;p&gt;Reviewer opens the draft at 11:30. Goes to lunch. Comes back at 1pm and hits approve. In the meantime, the customer sent two more messages and someone on the support team already replied manually. The approved draft is now responding to a conversation that moved on.&lt;/p&gt;

&lt;p&gt;LangGraph has no idea. It resumes from the checkpoint, which is frozen in time. Fix this by putting a &lt;code&gt;created_at&lt;/code&gt; timestamp in the interrupt payload and checking it against the customer record's &lt;code&gt;last_updated_at&lt;/code&gt; on resume. If anything changed, re-draft.&lt;/p&gt;

&lt;h3&gt;
  
  
  Double resume
&lt;/h3&gt;

&lt;p&gt;Shared review queue. Two reviewers see the same pending draft. Both click approve. Depending on the checkpointer implementation, the second resume is either a no-op or an error, but by then the send logic already fired on the first one. Maybe that's fine. Maybe you just sent duplicate emails.&lt;/p&gt;

&lt;p&gt;Build in idempotency to check if the thread already has a &lt;code&gt;review_decision&lt;/code&gt; before doing anything with the resume.&lt;/p&gt;

&lt;h3&gt;
  
  
  Interrupt reordering
&lt;/h3&gt;

&lt;p&gt;Two &lt;code&gt;interrupt()&lt;/code&gt; calls in one node (say, one for policy review and one for tone). LangGraph matches resume values to interrupts by position, not by name. There are no names. Refactor and swap the order, the policy answer goes to the tone check and vice versa.&lt;/p&gt;

&lt;p&gt;Don't put multiple interrupts in one node, instead use separate nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing across the gap
&lt;/h2&gt;

&lt;p&gt;Interrupt-based workflows leave a gap in the LangSmith timeline where the human review happened. The draft trace ends, then hours later the resume trace starts, and nothing connects them unless you're deliberate about it.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

ticket_id = "TICKET-4821"
config = {"configurable": {"thread_id": ticket_id}}

# Phase 1: Draft
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "draft"},
    tags=["production", "complaint-handler", "phase-1"],
):
    result = graph.invoke(
        {
            "complaint": "Your app crashed and I lost 3 hours of work.",
            "customer_id": "cust_2291",
        },
        config=config,
    )

# ... time passes, human reviews ...

# Phase 2: Resume
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "resume", "reviewer": "jane@company.com"},
    tags=["production", "complaint-handler", "phase-2"],
):
    final = graph.invoke(
        Command(resume={"action": "approve", "notes": "Looks good."}),
        config=config,
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Put the ticket ID in the metadata for both phases. Now you can filter in LangSmith and see the full lifecycle of a single complaint even though draft and resume were separate invocations. The &lt;code&gt;reviewer&lt;/code&gt; field in phase 2 is your audit trail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;You need to know if drafts are any good before a human ever sees them.&lt;/p&gt;

&lt;p&gt;Dataset setup and evaluators live in &lt;code&gt;evals.py&lt;/code&gt; in the companion repo:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

from complaint_handler import graph

ls_client = Client()

DATASET_NAME = "complaint-handler-evals"

if not ls_client.has_dataset(dataset_name=DATASET_NAME):
    dataset = ls_client.create_dataset(
        dataset_name=DATASET_NAME,
        description="Human-in-the-loop complaint handler evaluation dataset",
    )
    ls_client.create_examples(
        dataset_id=dataset.id,
        inputs=[
            {
                "complaint": "Charged twice for order #A-1234. Want a refund.",
                "customer_id": "cust_001",
            },
            {
                "complaint": "App crashes every time I open the settings page.",
                "customer_id": "cust_002",
            },
            {
                "complaint": "Your CEO's tweet was offensive. Cancelling my account.",
                "customer_id": "cust_003",
            },
        ],
        outputs=[
            {
                "must_mention": ["refund", "order", "A-1234"],
                "risk": "high",
            },
            {
                "must_mention": ["crash", "settings", "investigating"],
                "risk": "medium",
            },
            {
                "must_mention": ["feedback", "understand", "account"],
                "risk": "high",
            },
        ],
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Three evaluators. LLM judge for draft quality, keyword coverage, and a check for unauthorized promises:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DRAFT_QUALITY_PROMPT = """\
Customer complaint: {inputs}
AI draft response: {outputs}

Rate 0.0-1.0 on empathy, accuracy, and professionalism.
Deduct points if the draft promises specific remedies (refunds, credits)
without explicit authorization.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

draft_judge = create_llm_as_judge(
    prompt=DRAFT_QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="draft_quality",
    continuous=True,
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the draft actually address the complaint specifics?"""
    text = outputs.get("draft_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def no_unauthorized_promises(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the draft promise refunds or credits without authorization?"""
    text = outputs.get("draft_response", "").lower()
    dangerous_phrases = ["refund has been", "credit has been", "we will refund",
                         "we will credit", "compensation of"]
    violations = sum(1 for p in dangerous_phrases if p in text)
    return {"key": "no_unauthorized_promises", "score": 1.0 if violations == 0 else 0.0}


def target(inputs: dict) -&amp;gt; dict:
    """Run the graph until the interrupt (draft phase only)."""
    config = {"configurable": {"thread_id": f"eval-{inputs['customer_id']}"}}
    result = graph.invoke(inputs, config=config)
    return {"draft_response": result.get("draft_response", "")}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;no_unauthorized_promises&lt;/code&gt; catches the failure mode from the top of this post. If the draft says "a refund has been initiated" when nobody authorized a refund, it scores zero. Run this eval every time you change the system prompt.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if &lt;strong&gt;name&lt;/strong&gt; == "&lt;strong&gt;main&lt;/strong&gt;":&lt;br&gt;
    results = evaluate(&lt;br&gt;
        target,&lt;br&gt;
        data=DATASET_NAME,&lt;br&gt;
        evaluators=[draft_judge, coverage, no_unauthorized_promises],&lt;br&gt;
        experiment_prefix="complaint-handler-v1",&lt;br&gt;
        max_concurrency=4,&lt;br&gt;
    )&lt;br&gt;
    print("\nEvaluation complete. Check LangSmith for results.")&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  When to Human In The Loop&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;If AI is writing things that go to customers, you need a gate. Processing refunds, updating account records, anything you can't undo with a quick "sorry about that" email. Regulated industries need the gate plus an audit trail of who approved what.&lt;/p&gt;

&lt;p&gt;You don't need this for internal stuff. Summarizing meeting notes, running analysis for a dashboard, generating reports that a human reads. &lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;The two function calls: &lt;code&gt;interrupt()&lt;/code&gt; and &lt;code&gt;Command(resume=...)&lt;/code&gt;. Pause execution, persist state, resume later.&lt;/p&gt;

&lt;p&gt;Most of the work is everything around those two calls. Thread IDs getting lost, the world changing during the review gap, two reviewers approving the same draft, traces that need to connect across a timeline gap of hours or days.&lt;/p&gt;

&lt;p&gt;Start by routing every response through review. Reviewers will complain. Good. Measure which categories they rubber-stamp, run your evals, and only then start auto-approving the boring stuff.  &lt;/p&gt;

&lt;p&gt;Technical References&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/02-human-in-the-loop/tree/9e328bdd3770541a764134efa7f87d53de2dad6b" rel="noopener noreferrer"&gt;Human in the Loop Github Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/interrupts" rel="noopener noreferrer"&gt;Interrupts (Human-in-the-loop / pause &amp;amp; resume)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/persistence" rel="noopener noreferrer"&gt;Persistence (Thread IDs &amp;amp; Checkpointers)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/overview" rel="noopener noreferrer"&gt;LangGraph Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/evaluation-quickstart?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;LangSmith Eval Quickstarter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/your-ai-just-emailed-a-customer-without-permission" rel="noopener noreferrer"&gt;https://focused.io/lab/your-ai-just-emailed-a-customer-without-permission&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Streaming Agent State with LangGraph</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:15:26 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/streaming-agent-state-with-langgraph-10kg</link>
      <guid>https://dev.to/focused_dot_io/streaming-agent-state-with-langgraph-10kg</guid>
      <description>&lt;p&gt;Your research agent takes 9 seconds to answer a question. It fans out to three sources, synthesizes results, returns a polished answer. The user sees a blank screen for all nine of those seconds. By second 5 they've refreshed the page, doubled your API costs, and still seen nothing.&lt;/p&gt;

&lt;p&gt;Streaming fixes this. Show the user what the agent is doing while it's doing it: "Searching knowledge base...", "Found 3 results...", "Synthesizing..." and then stream the final answer token by token. Same 9 seconds, but the user sees progress from millisecond 200.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Perception Math
&lt;/h2&gt;

&lt;p&gt;Identical work, different user experience:&lt;/p&gt;

&lt;p&gt;Pattern&lt;/p&gt;

&lt;p&gt;Wall time&lt;/p&gt;

&lt;p&gt;Time to first byte&lt;/p&gt;

&lt;p&gt;Perceived wait&lt;/p&gt;

&lt;p&gt;&lt;code&gt;invoke() (no streaming)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;Broken&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stream(stream_mode="updates")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;~200ms&lt;/p&gt;

&lt;p&gt;Working&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stream(stream_mode=["updates", "custom", "messages"])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;~200ms&lt;/p&gt;

&lt;p&gt;Can see what it’s doing&lt;/p&gt;

&lt;h2&gt;
  
  
  What we're Building
&lt;/h2&gt;

&lt;p&gt;A multi-step research agent that streams three types of events to the UI: node-level progress updates, custom status messages from inside nodes, and token-by-token LLM output for the final synthesis.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                          ┌─ stream: "Searching KB..."
[Intake] → [Research KB]  ┤
                          └─ stream: {results: 3}
                                    ↓
                          ┌─ stream: "Analyzing results..."
         → [Synthesize]  ┤
                          └─ stream: tokens... t-o-k-e-n-b-y-t-o-k-e-n
                                    ↓
                                     → END
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Three stream modes run simultaneously: &lt;code&gt;updates&lt;/code&gt; for graph state changes, &lt;code&gt;custom&lt;/code&gt; for application-specific progress events, and &lt;code&gt;messages&lt;/code&gt; for LLM token streaming.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Modes
&lt;/h2&gt;

&lt;p&gt;LangGraph exposes five stream modes. You'll use three in practice:&lt;/p&gt;

&lt;p&gt;Mode&lt;/p&gt;

&lt;p&gt;What it streams&lt;/p&gt;

&lt;p&gt;When to use&lt;/p&gt;

&lt;p&gt;&lt;code&gt;values&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Full state after each superstep&lt;/p&gt;

&lt;p&gt;Debugging, state inspection&lt;/p&gt;

&lt;p&gt;&lt;code&gt;updates&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;State delta from each node&lt;/p&gt;

&lt;p&gt;Production UIs — lightweight, shows which node ran&lt;/p&gt;

&lt;p&gt;&lt;code&gt;messages&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;LLM tokens + metadata&lt;/p&gt;

&lt;p&gt;Chat UIs — token-by-token output&lt;/p&gt;

&lt;p&gt;&lt;code&gt;custom&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Arbitrary data from &lt;code&gt;get_stream_writer()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Progress bars, status messages, structured events&lt;/p&gt;

&lt;p&gt;&lt;code&gt;debug&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Everything — internal execution details&lt;/p&gt;

&lt;p&gt;Development only&lt;/p&gt;

&lt;p&gt;In production, use &lt;code&gt;["updates", "custom", "messages"]&lt;/code&gt;. &lt;code&gt;values&lt;/code&gt; sends the entire state on every step. &lt;code&gt;debug&lt;/code&gt; is for development.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;State and two nodes: a research step that emits custom progress events, and a synthesizer that streams its LLM response token by token.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.config import get_stream_writer
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research: str
    answer: str
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The research node uses &lt;code&gt;get_stream_writer()&lt;/code&gt; to push status updates to the client. These show up in the &lt;code&gt;custom&lt;/code&gt; stream mode:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="research", run_type="chain")
def research(state: State) -&amp;gt; dict:
    writer = get_stream_writer()

    writer({"step": "research", "status": "starting", "message": "Searching knowledge base..."})

    response = llm.invoke([
        SystemMessage(
            content="You are a research assistant. Search for relevant information "
                    "about the user's question. Return a concise summary of findings."
        ),
        HumanMessage(content=state["question"]),
    ])

    writer({"step": "research", "status": "complete", "message": "Research complete."})

    return {"research": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The synthesizer uses the LLM normally. LangGraph automatically streams its tokens when &lt;code&gt;messages&lt;/code&gt; mode is active:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="synthesize", run_type="chain")
def synthesize(state: State) -&amp;gt; dict:
    writer = get_stream_writer()
    writer({"step": "synthesize", "status": "starting", "message": "Synthesizing answer..."})

    response = llm.invoke([
        SystemMessage(
            content="Synthesize the research into a clear, actionable answer. "
                    "Be concise but thorough."
        ),
        HumanMessage(
            content=f"Question: {state['question']}\n\nResearch:\n{state['research']}"
        ),
    ])

    writer({"step": "synthesize", "status": "complete", "message": "Done."})
    return {"answer": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Graph Assembly
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("research", research)
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "research")
builder.add_edge("research", "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Multi-mode Streaming
&lt;/h2&gt;

&lt;p&gt;A single &lt;code&gt;.stream()&lt;/code&gt; call can emit node updates, custom progress events, and LLM tokens simultaneously:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for mode, chunk in graph.stream(
    {"question": "What are the key differences between REST and GraphQL for mobile APIs?"},
    stream_mode=["updates", "custom", "messages"],
):
    if mode == "updates":
        # Node completed — chunk is the state delta
        node_name = list(chunk.keys())[0]
        print(f"[node] {node_name} completed")

    elif mode == "custom":
        # Custom progress event from get_stream_writer()
        print(f"[status] {chunk.get('message', chunk)}")

    elif mode == "messages":
        # LLM token — chunk is a tuple of (message_chunk, metadata)
        message_chunk, metadata = chunk
        if hasattr(message_chunk, "content") and message_chunk.content:
            print(message_chunk.content, end="", flush=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Note that the output shape changes with multi-mode. Single mode (&lt;code&gt;stream_mode="updates"&lt;/code&gt;) yields chunks directly. Multi-mode (&lt;code&gt;stream_mode=["updates", "custom"]&lt;/code&gt;) yields &lt;code&gt;(mode, chunk)&lt;/code&gt; tuples. Code that works with single mode breaks with multi-mode because the unpacking is different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Async streaming
&lt;/h2&gt;

&lt;p&gt;For production APIs, use &lt;code&gt;astream&lt;/code&gt; with &lt;code&gt;async for&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import asyncio

from langsmith import traceable


@traceable(name="stream_research", run_type="chain")
async def stream_research(question: str):
    chunks = []
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                chunks.append(message_chunk.content)
                yield {"type": "token", "content": message_chunk.content}
        elif mode == "custom":
            yield {"type": "status", "content": chunk}
        elif mode == "updates":
            yield {"type": "node_update", "content": chunk}


async def main():
    async for event in stream_research("How do vector databases work?"):
        if event["type"] == "token":
            print(event["content"], end="", flush=True)
        else:
            print(f"\n[{event['type']}] {event['content']}")

asyncio.run(main())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  FastAPI + SSE
&lt;/h2&gt;

&lt;p&gt;The standard production pattern is a FastAPI endpoint that converts graph streams to SSE. SSE is one-directional (server to client), works over HTTP/1.1, and auto-reconnects:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langsmith import traceable

app = FastAPI()


@traceable(name="sse_research_stream", run_type="chain")
async def generate_sse(question: str):
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                data = json.dumps({"type": "token", "content": message_chunk.content})
                yield f"data: {data}\n\n"
        elif mode == "custom":
            data = json.dumps({"type": "status", "content": chunk})
            yield f"data: {data}\n\n"
        elif mode == "updates":
            node_name = list(chunk.keys())[0] if chunk else "unknown"
            data = json.dumps({"type": "node_complete", "node": node_name})
            yield f"data: {data}\n\n"

    yield "data: [DONE]\n\n"


@app.post("/research/stream")
async def stream_endpoint(payload: dict):
    return StreamingResponse(
        generate_sse(payload["question"]),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        },
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Set &lt;code&gt;X-Accel-Buffering: no&lt;/code&gt; in the response headers and &lt;code&gt;proxy_buffering off&lt;/code&gt; in your nginx config. Without these, nginx buffers the entire response before sending it to the client and your streaming pipeline becomes a regular HTTP response.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bugs
&lt;/h2&gt;

&lt;p&gt;These break under load.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reverse proxy buffering
&lt;/h3&gt;

&lt;p&gt;You deploy behind nginx or a cloud load balancer. SSE events arrive at the client in one big batch after the stream completes. Cause: proxy buffering is on by default. Set the &lt;code&gt;X-Accel-Buffering&lt;/code&gt; header, disable &lt;code&gt;proxy_buffering&lt;/code&gt; in nginx, and check your cloud provider's load balancer settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Message chunk ordering
&lt;/h3&gt;

&lt;p&gt;With &lt;code&gt;messages&lt;/code&gt; mode, you receive &lt;code&gt;AIMessageChunk&lt;/code&gt; objects. The &lt;code&gt;content&lt;/code&gt; field is usually a string, except when the model returns tool calls where it's a list of content blocks. Concatenating &lt;code&gt;.content&lt;/code&gt; naively produces garbled output. Check &lt;code&gt;isinstance(message_chunk.content, str)&lt;/code&gt; before concatenating and handle tool-call chunks separately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Backpressure on slow clients
&lt;/h3&gt;

&lt;p&gt;Your agent streams tokens faster than the client can consume them (mobile on 3G, overloaded browser tab). The server-side buffer grows until memory pressure kills the process. Use bounded async queues or configure your ASGI server's per-connection send buffer limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mixed single/multi mode unpacking
&lt;/h3&gt;

&lt;p&gt;Developer switches from &lt;code&gt;stream_mode="updates"&lt;/code&gt; to &lt;code&gt;stream_mode=["updates", "custom"]&lt;/code&gt; and doesn't update the unpacking code. The &lt;code&gt;for chunk in graph.stream(...)&lt;/code&gt; now yields &lt;code&gt;(mode, chunk)&lt;/code&gt; tuples, but the code tries to use the tuple as a dict. No error, just wrong data flowing through. Always use multi-mode from the start, even if you only need one mode today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Stream-based workflows produce many small events. Tag your traces so you can measure stream performance in &lt;a href="https://www.langchain.com/langsmith/observability" rel="noopener noreferrer"&gt;LangSmith&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

with tracing_context(
    metadata={
        "stream_mode": "multi",
        "client_type": "web",
        "session_id": "sess_12345",
    },
    tags=["production", "streaming", "v1"],
):
    for mode, chunk in graph.stream(
        {"question": "Explain vector similarity search"},
        stream_mode=["updates", "custom", "messages"],
    ):
        pass  # process chunks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The LangSmith trace shows per-node timings. Use this to find nodes that are slow to emit their first token (high time-to-first-byte) vs. nodes that produce tokens slowly (low throughput).&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;Streaming doesn't change what the agent produces, it changes how the output is delivered. Evals verify that streamed output matches what &lt;code&gt;invoke()&lt;/code&gt; would return, and that custom events are emitted correctly.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="streaming-agent-evals",
    description="Streaming research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What are the tradeoffs between REST and GraphQL?"},
        {"question": "How do vector databases enable semantic search?"},
        {"question": "What is retrieval-augmented generation?"},
    ],
    outputs=[
        {"must_mention": ["REST", "GraphQL", "tradeoff"]},
        {"must_mention": ["vector", "embedding", "similarity"]},
        {"must_mention": ["retrieval", "generation", "context"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
User question: {inputs[question]}
Agent response: {outputs[answer]}

Rate 0.0-1.0 on completeness, accuracy, and clarity.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the response address the key topics?"""
    text = outputs.get("answer", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def stream_completeness(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Does streaming produce the same output as invoke?"""
    streamed = outputs.get("answer", "")
    invoked_result = graph.invoke({"question": inputs["question"]})
    invoked = invoked_result.get("answer", "")
    # Exact match is too strict — LLM outputs vary. Check key content overlap.
    streamed_words = set(streamed.lower().split())
    invoked_words = set(invoked.lower().split())
    if not invoked_words:
        return {"key": "stream_completeness", "score": 1.0}
    overlap = len(streamed_words &amp;amp; invoked_words) / len(invoked_words)
    return {"key": "stream_completeness", "score": min(overlap, 1.0)}


def custom_events_emitted(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Were custom status events emitted during streaming?"""
    events = outputs.get("custom_events", [])
    expected_steps = {"research", "synthesize"}
    seen_steps = {e.get("step") for e in events if isinstance(e, dict)}
    coverage_score = len(seen_steps &amp;amp; expected_steps) / len(expected_steps)
    return {"key": "custom_events", "score": coverage_score}


def target(inputs: dict) -&amp;gt; dict:
    custom_events = []
    answer_chunks = []
    for mode, chunk in graph.stream(
        {"question": inputs["question"]},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "custom":
            custom_events.append(chunk)
        elif mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                answer_chunks.append(message_chunk.content)
        elif mode == "updates":
            if "synthesize" in chunk:
                pass  # answer is captured via message chunks

    return {
        "answer": "".join(answer_chunks) if answer_chunks else "",
        "custom_events": custom_events,
    }


results = evaluate(
    target,
    data="streaming-agent-evals",
    evaluators=[quality_judge, coverage, stream_completeness, custom_events_emitted],
    experiment_prefix="streaming-agent-v1",
    max_concurrency=4,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;stream_completeness&lt;/code&gt; verifies that the streaming path produces equivalent output to &lt;code&gt;invoke()&lt;/code&gt;. This catches bugs where stream chunking drops content, like an SSE serializer silently truncating chunks that exceed a size limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Stream
&lt;/h2&gt;

&lt;p&gt;Use streaming for any user-facing agent interaction over 2 seconds, multi-step agents where progress indicators reduce perceived latency, and chat interfaces where token-by-token display is expected.&lt;/p&gt;

&lt;p&gt;Skip it for background jobs with no user waiting, when latency is already under a second, and when the output is structured data rather than natural language.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Three modes in production: &lt;code&gt;updates&lt;/code&gt; for node transitions, &lt;code&gt;custom&lt;/code&gt; for progress events via &lt;code&gt;get_stream_writer()&lt;/code&gt;, and &lt;code&gt;messages&lt;/code&gt; for token streaming. Combine them with &lt;code&gt;stream_mode=["updates", "custom", "messages"]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Deploy behind FastAPI + SSE with &lt;code&gt;X-Accel-Buffering: no&lt;/code&gt;. Watch for reverse proxy buffering, backpressure on slow clients, and the single-to-multi mode unpacking change.  &lt;/p&gt;

&lt;p&gt;Technical References:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/03-streaming-agents" rel="noopener noreferrer"&gt;Streaming Agent State GitHub repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/streaming" rel="noopener noreferrer"&gt;LangGraph Streaming (Python)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langchain/streaming/overview" rel="noopener noreferrer"&gt;LangChain Streaming Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/add-metadata-tags" rel="noopener noreferrer"&gt;LangSmith Tracing Metadata &amp;amp; Tags&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/streaming-agent-state-with-langgraph" rel="noopener noreferrer"&gt;https://focused.io/lab/streaming-agent-state-with-langgraph&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Driving Value with LangSmith Insights</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:15:24 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/driving-value-with-langsmith-insights-5bp</link>
      <guid>https://dev.to/focused_dot_io/driving-value-with-langsmith-insights-5bp</guid>
      <description>&lt;p&gt;Imagine you have a deployed agentic system in production. Everything is going well, users are interacting with the product, and there are no critical issues going on. But what comes next? How can we monitor our system to understand what needs to be improved, fixed or built next? &lt;/p&gt;

&lt;p&gt;The first requirement is to have great observability. LangSmith is a great tool for this.&lt;/p&gt;

&lt;p&gt;We can use it to monitor all of our production runs, detect errors and understand how the model behaves across different interactions.&lt;/p&gt;

&lt;p&gt;In October 2025, October LangChain released a new feature: &lt;a href="https://www.blog.langchain.com/insights-agent-multiturn-evals-langsmith/" rel="noopener noreferrer"&gt;&lt;strong&gt;Insights Agent&lt;/strong&gt;&lt;/a&gt;. This feature allows an agent to analyze your LangSmith traces and surface usage patterns, common behaviors, and recurring error modes automatically. Instead of manually digging through logs, you can let an agent do the analysis for you. If you want to read more about it, here's a &lt;a href="https://docs.langchain.com/langsmith/insights?ref=blog.langchain.com&amp;amp;ajs_aid=d4bdd020-281f-4f7e-86e5-726ef5abdfe6&amp;amp;ajs_uid=141a090a-530d-4b83-96d1-d2b713439671&amp;amp;_gl=1*a05q53*_gcl_au*MTgzNDAzODM2OS4xNzY2MjUyNTM4*_ga*MTAyNTcyODQ0NS4xNzYzMzg1Mjg0*_ga_47WX3HKKY2*czE3NjYyNTI1MzgkbzkyJGcwJHQxNzY2MjUyNTM4JGo2MCRsMCRoMA.." rel="noopener noreferrer"&gt;link to the docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to run the Insights Agent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We are going to go through a simple demo of how to use this exciting new tool with a simple chatbot graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Plus or Enterprise LangSmith plan&lt;/li&gt;
&lt;li&gt;A tracing project with a good amount of traces to analyze&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first thing we need to do is go to our LangSmith project. Once there, we are going to see multiple tabs on the top of the screen. Click on the one that says “Insights”.&lt;/p&gt;

&lt;p&gt;If this is our first time running Insights, we are going to see an empty page and a “Create Insight” button. We can go ahead and click it.&lt;/p&gt;

&lt;p&gt;Now, we are presented with two alternatives for how to run the Insights Agent: auto or manual. For the sake of simplicity, let’s start with the “auto” mode.  &lt;/p&gt;

&lt;p&gt;We need to answer the following questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;“What does the agent in this tracing project do?”&lt;/em&gt;&lt;/li&gt;
&lt;li&gt; &lt;em&gt;“What would you like to learn about this agent?”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt; &lt;em&gt;“How are traces in this tracing project structured? Are there specific input/output keys to pay attention to?”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This information will be used in our agent prompt, and will help tailor the output to our needs.&lt;/p&gt;

&lt;p&gt;We can also choose if we want to use OpenAI or Anthropic as our provider. As a note, you will need an API key for either provider.&lt;/p&gt;

&lt;p&gt;After we click on “Run Job”, we are going to see a message saying the agent has started running in the background and that we will have our results in a few minutes. If we navigate to the Insights tab we are going to see the agent run in progress as well as the results that start to come out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to understand and use the results&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For this example, we are going to be using a chatbot that answers questions about restaurants and helps with making reservations.&lt;/p&gt;

&lt;p&gt;The first part of the output is a summary of the findings. This is going to be the answer to the question we were asked earlier about what we wanted to learn about this agent. In this case, we wanted to understand what customers were asking the chatbot in order to identify user patterns.&lt;/p&gt;

&lt;p&gt;We can see that in this example, 57% of the questions being asked to our chatbot are about feature discovery, 29% are about operating hours, and only 14% are about making reservations.&lt;/p&gt;

&lt;p&gt;This kind of result is interesting because it helps us understand what customers actually need. Maybe we initially assumed that most questions would be about making reservations, but this data doesn’t support that. &lt;strong&gt;LangSmith Insights is critical because it grounds our product decisions in real user behavior, helping us invest engineering effort where it delivers the most value.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If we click on the “Hide Findings” button, we can do a deep dive into the traces, broken down by category.&lt;/p&gt;

&lt;p&gt;If we click on any of the categories we can see all runs within that category and navigate to the trace we are interested in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using evaluation + Insights to get the highest impact on value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once we are comfortable with the categories of our generated insights, we can build evaluation datasets that mirror those categories. This way, we can understand how well our agent is answering questions across categories.&lt;/p&gt;

&lt;p&gt;Why do insights change this process? Imagine we run our evaluations and we discover the agent is only answering 40% of questions around reservations correctly. But insights reveal that reservation questions are actually the least common user queries. That context lowers the overall criticality of the issue and helps us prioritize fixes more intelligently.&lt;/p&gt;

&lt;p&gt;Insights add context to the analysis, but they don’t override business requirements. This is only an example: Depending on the use case, a low-frequency category like reservations may still demand zero errors if the business impact is high.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We have gone through a simple example to illustrate the power of this tool. But as we’ve seen, we can ask the agent virtually any question we want. For example, we could ask, &lt;em&gt;“What types of questions is my agent hallucinating on or answering incorrectly?”&lt;/em&gt; and the agent will find all traces that match that criteria. This is extremely flexible and powerful.&lt;/p&gt;

&lt;p&gt;LangSmith is still king when it comes to building and observing production grade AI applications, and this kind of feature is the reason why I encourage you to try it out and continue to create amazing applications with it!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/driving-value-with-langsmith-insights" rel="noopener noreferrer"&gt;https://focused.io/lab/driving-value-with-langsmith-insights&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Most Teams Don't Have a Data Flywheel</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 22 Apr 2026 18:56:27 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/most-teams-dont-have-a-data-flywheel-33o4</link>
      <guid>https://dev.to/focused_dot_io/most-teams-dont-have-a-data-flywheel-33o4</guid>
      <description>&lt;p&gt;&lt;em&gt;LangChain shows how the loop works. Here's why it stalls in production and what it actually takes to make it compound.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Austin Vance, CEO of&lt;a href="https://focused.io" rel="noopener noreferrer"&gt;Focused&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;LangChain has been pushing a clear idea: production data should make your agents better.&lt;/p&gt;

&lt;p&gt;The loop looks like this: production traces capture real behavior, those traces become datasets, evaluators score performance, feedback improves those evaluators, and improvements get deployed back into the system. Over time, the system compounds.&lt;/p&gt;

&lt;p&gt;That is the data flywheel.&lt;/p&gt;

&lt;p&gt;And it is directionally right.&lt;/p&gt;

&lt;p&gt;But most teams building agents today are not seeing that compounding effect. The loop exists on paper. In practice, it stalls.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Flywheel Actually Is
&lt;/h2&gt;

&lt;p&gt;In the LangChain ecosystem, especially with LangSmith, the flywheel connects three things: observability, evaluation, and iteration.&lt;/p&gt;

&lt;p&gt;Production traces become the source of truth. Failures are turned into datasets. Datasets become regression tests. Evaluators score performance at scale. Feedback improves those evaluators over time.&lt;/p&gt;

&lt;p&gt;The goal is simple: every production interaction should become an improvement signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Breaks
&lt;/h2&gt;

&lt;p&gt;The issue is not the idea. The issue is that most teams never fully implement the system required to make it work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Traces are collected, but nothing happens.&lt;/strong&gt; Teams instrument their agents. They capture inputs, outputs, tool calls, and intermediate steps. And then it stops there. The missing step is turning traces into something actionable — structured datasets, labeled failures, repeatable test cases. Without that, you are not building a flywheel. You are just logging behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. There is no real evaluation layer.&lt;/strong&gt; This is where most teams stall. They review outputs manually. They rely on intuition. They make changes based on what "looks better." There is no automated evaluation, no regression testing, no baseline performance. So when something changes, there is no way to know if it improved or regressed. If you cannot measure it, the loop does not spin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Evaluators are not trusted.&lt;/strong&gt; Even when teams introduce evaluation, it often breaks down. LLM-as-a-judge systems can scale evaluation, but only if they are clearly defined, calibrated against human feedback, and continuously refined. Without that, evaluator output becomes noisy. And noisy signals lead to random changes. If you do not trust your evaluation layer, you cannot rely on your flywheel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The loop never actually closes.&lt;/strong&gt; Even when failures are identified, prompts get tweaked ad hoc, changes are not versioned, and fixes are not tested against past failures. So nothing compounds. A real loop looks like this: a failure is captured, the failure becomes a dataset, the dataset is evaluated, a change is applied, and the change is tested against that dataset. If you skip any step, the loop breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. There is no real production pressure.&lt;/strong&gt; This is the quiet failure that kills most flywheels. If your agent is not embedded in a real system, you do not get meaningful traffic, you do not see real edge cases, and you do not generate useful data. Internal demos do not create real signals. Without real usage, the flywheel has nothing to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Real Data Flywheel Looks Like
&lt;/h2&gt;

&lt;p&gt;At a system level, this is not a concept. It is a pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instrumentation.&lt;/strong&gt; Every step of the agent is observable — inputs, decisions, state transitions, outputs. Using structured systems like LangGraph makes this consistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset creation.&lt;/strong&gt; Production traces are turned into labeled examples, categorized failures, and reusable datasets. This is where the loop actually begins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation.&lt;/strong&gt; You define what "good" looks like and measure it — correctness, tool selection, completion quality. Evaluations run continuously, not just during development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibration.&lt;/strong&gt; Evaluators improve over time. Human feedback corrects them, agreement is measured, alignment increases. This step is critical and often skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration and deployment.&lt;/strong&gt; Changes are applied intentionally — to prompts, graph structure, and tool logic. Then tested against historical failures before being deployed. Only validated improvements ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift Most Teams Need to Make
&lt;/h2&gt;

&lt;p&gt;The data flywheel is often described like a product feature. That is the problem.&lt;/p&gt;

&lt;p&gt;It is not something you turn on. It is an engineering system that connects observability, evaluation, feedback, and deployment into a continuous loop. Without that system, you do not have a flywheel. You have logs and intuition.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Most teams do not have a data flywheel. They have a growing pile of traces and a sense that things might be improving.&lt;/p&gt;

&lt;p&gt;The teams that actually get better over time treat this differently. They build the system that makes improvement inevitable.&lt;/p&gt;

&lt;p&gt;If your agent only records what happened, it will stall. If your system learns from what happened, it compounds.&lt;/p&gt;

&lt;p&gt;That is the difference.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>langchain</category>
      <category>programming</category>
    </item>
    <item>
      <title>LangGraph Error Handling Patterns for Production AI Agents</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 21 Apr 2026 18:53:58 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/langgraph-error-handling-patterns-for-production-ai-agents-33p7</link>
      <guid>https://dev.to/focused_dot_io/langgraph-error-handling-patterns-for-production-ai-agents-33p7</guid>
      <description>&lt;p&gt;You have a document processing pipeline. It ingests contracts, extracts key clauses, validates them against policy, and generates a summary. Monday morning it processes 200 documents without a hiccup. Tuesday at 2 AM, Anthropic’s API returns a 429, the extraction node throws, and &lt;strong&gt;the entire pipeline stops.&lt;/strong&gt; Not just the one document — the whole batch. Your on-call engineer spends 45 minutes figuring out it was a transient rate limit that would have resolved itself with a 2-second backoff.&lt;/p&gt;

&lt;p&gt;The fix isn’t “add a try/except.” The fix is classifying errors by who can fix them and routing each class to the right handler. LangGraph gives you the primitives — &lt;code&gt;RetryPolicy&lt;/code&gt;, &lt;code&gt;Command&lt;/code&gt;, &lt;code&gt;interrupt()&lt;/code&gt;, and &lt;code&gt;ToolNode&lt;/code&gt; error handling — but the framework won’t decide your error strategy for you. That’s on you.&lt;/p&gt;

&lt;p&gt;This post shows the four error classes, the LangGraph primitives for each, and the production failures that surface when you get the classification wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Production Error Handling Classification Matrix
&lt;/h2&gt;

&lt;p&gt;Not all errors are equal. The single most important decision in your error-handling strategy is: &lt;strong&gt;who fixes this?&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error Class&lt;/th&gt;
&lt;th&gt;Who Fixes It&lt;/th&gt;
&lt;th&gt;LangGraph Primitive&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transient&lt;/td&gt;
&lt;td&gt;System (automatic)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;RetryPolicy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;API 429, network timeout, DNS blip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-Recoverable&lt;/td&gt;
&lt;td&gt;The LLM&lt;/td&gt;
&lt;td&gt;Error in state + loop back&lt;/td&gt;
&lt;td&gt;Tool returned bad JSON, wrong tool chosen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User-Fixable&lt;/td&gt;
&lt;td&gt;The human&lt;/td&gt;
&lt;td&gt;&lt;code&gt;interrupt()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Missing required field, ambiguous input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unexpected&lt;/td&gt;
&lt;td&gt;The developer&lt;/td&gt;
&lt;td&gt;Let it bubble up&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;TypeError&lt;/code&gt;, schema mismatch, logic bug&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Getting this wrong costs you. Retrying a user-fixable error wastes 3 attempts and 6 seconds before failing anyway. Interrupting for a transient error pages a human to click “retry” on something that would have fixed itself. Swallowing an unexpected error hides a real bug behind a generic fallback.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;We’re building a document processing pipeline that extracts clauses from contracts, validates them, and generates summaries. Each node has a different error profile:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;import operator&lt;br&gt;
from typing import Annotated, TypedDict&lt;/p&gt;

&lt;p&gt;from langchain_anthropic import ChatAnthropic&lt;br&gt;
from langchain_core.messages import AnyMessage&lt;br&gt;
from langsmith import traceable&lt;/p&gt;

&lt;p&gt;llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)&lt;/p&gt;

&lt;p&gt;class PipelineState(TypedDict):&lt;br&gt;
    document: str&lt;br&gt;
    messages: Annotated[list[AnyMessage], operator.add]&lt;br&gt;
    extracted_clauses: list[dict]&lt;br&gt;
    validation_errors: list[str]&lt;br&gt;
    retry_count: int&lt;br&gt;
    final_summary: str&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The extraction node calls tools and hits APIs — it gets transient errors and tool failures. The validation node needs complete data — it surfaces user-fixable gaps. The summarizer is the least error-prone but still needs retry protection.&lt;/p&gt;
&lt;h2&gt;
  
  
  State: Track Errors Explicitly
&lt;/h2&gt;

&lt;p&gt;The key insight: &lt;strong&gt;errors are data, not just exceptions.&lt;/strong&gt; Store them in state so the LLM can see what went wrong and adjust its approach.&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;python&lt;/p&gt;

&lt;p&gt;from langgraph.types import RetryPolicy&lt;/p&gt;
&lt;h1&gt;
  
  
  Aggressive retry for flaky external APIs
&lt;/h1&gt;

&lt;p&gt;api_retry = RetryPolicy(&lt;br&gt;
    max_attempts=5,&lt;br&gt;
    initial_interval=1.0,&lt;br&gt;
    backoff_factor=2.0,&lt;br&gt;
    max_interval=10.0,&lt;br&gt;
    jitter=True,&lt;br&gt;
)&lt;/p&gt;
&lt;h1&gt;
  
  
  Conservative retry for LLM calls (they're expensive)
&lt;/h1&gt;

&lt;p&gt;llm_retry = RetryPolicy(&lt;br&gt;
    max_attempts=3,&lt;br&gt;
    initial_interval=0.5,&lt;br&gt;
    backoff_factor=2.0,&lt;br&gt;
    max_interval=5.0,&lt;br&gt;
    jitter=True,&lt;br&gt;
)&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;
&lt;h2&gt;
  
  
  Pattern 1: Transient Errors with RetryPolicy
&lt;/h2&gt;

&lt;p&gt;API rate limits, network blips, DNS hiccups. These fix themselves. Don’t write code for them — configure them.&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;python&lt;/p&gt;

&lt;p&gt;from httpx import HTTPStatusError&lt;/p&gt;

&lt;p&gt;def should_retry(error: Exception) -&amp;gt; bool:&lt;br&gt;
    if isinstance(error, HTTPStatusError):&lt;br&gt;
        return error.response.status_code in (429, 502, 503)&lt;br&gt;
    return False&lt;/p&gt;

&lt;p&gt;selective_retry = RetryPolicy(&lt;br&gt;
    max_attempts=5,&lt;br&gt;
    initial_interval=1.0,&lt;br&gt;
    backoff_factor=2.0,&lt;br&gt;
    retry_on=should_retry,&lt;br&gt;
)&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RetryPolicy&lt;/code&gt; parameters worth knowing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_attempts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Total attempts including the first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;initial_interval&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;Seconds before first retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;backoff_factor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.0&lt;/td&gt;
&lt;td&gt;Multiplier per retry (exponential backoff)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_interval&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;128.0&lt;/td&gt;
&lt;td&gt;Cap on wait time between retries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jitter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;True&lt;/td&gt;
&lt;td&gt;Randomize wait to avoid thundering herd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retry_on&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;(default exceptions)&lt;/td&gt;
&lt;td&gt;Exception types or callable to filter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;retry_on&lt;/code&gt; parameter is where most people get it wrong. The default retries on common network/transient exceptions. If you need to retry on a custom exception type:&lt;/p&gt;

&lt;p&gt;``&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.prebuilt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ToolNode&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_clause&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract a specific clause from contract text.

    Args:
        text: The contract text to search.
        clause_type: One of &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;liability&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;indemnification&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;valid_types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;liability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indemnification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;clause_type&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;valid_types&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid clause_type &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Must be one of: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;valid_types&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracted &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; clause from document.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_compliance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regulation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check if a clause complies with a specific regulation.

    Args:
        clause: The clause text to check.
        regulation: The regulation identifier (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GDPR-Art17&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SOX-302&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Empty clause text provided. Extract the clause first.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compliant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regulation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;regulation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No issues found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;extract_clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;check_compliance&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# handle_tool_errors=True: catch exceptions, return error as ToolMessage
&lt;/span&gt;&lt;span class="n"&gt;tool_node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_tool_errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The superstep transaction rule:&lt;/strong&gt; LangGraph executes parallel branches in supersteps. If any branch in a superstep raises an exception, &lt;strong&gt;none of the state updates from that superstep apply.&lt;/strong&gt; Successful branches are checkpointed and won’t re-execute on retry, but the state snapshot rolls back to before the superstep started. This means a flaky API in one branch can block state updates from an unrelated branch that succeeded. &lt;code&gt;RetryPolicy&lt;/code&gt; per node keeps one bad branch from poisoning the whole superstep.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pattern 2: LLM-Recoverable Errors with ToolNode
&lt;/h2&gt;

&lt;p&gt;Tool calls fail. The LLM picks the wrong tool, passes bad arguments, or the tool returns something unparseable. The fix isn’t retrying the exact same call — it’s letting the LLM see what went wrong and try a different approach.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ToolNode&lt;/code&gt; from &lt;code&gt;langgraph.prebuilt&lt;/code&gt; has a &lt;code&gt;handle_tool_errors&lt;/code&gt; parameter that catches tool exceptions and returns the error message as a &lt;code&gt;ToolMessage&lt;/code&gt;. The LLM sees the error and can adjust:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_tool_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool failed with: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review your arguments and try again. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check the tool&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s docstring for valid parameter values.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tool_node_custom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_tool_errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;format_tool_error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also pass a custom error handler for more control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SystemMessage&lt;/span&gt;

&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a contract analysis agent. Use the provided tools to &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract and validate clauses. If a tool returns an error, read &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the error message carefully and adjust your arguments. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Available clause types: termination, liability, indemnification, payment.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent node calls the LLM, which may invoke tools. When a tool fails, &lt;code&gt;handle_tool_errors=True&lt;/code&gt; catches the exception and sends the error back to the LLM as a &lt;code&gt;ToolMessage&lt;/code&gt;. The LLM sees the error and tries again — usually with corrected arguments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;interrupt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Command&lt;/span&gt;

&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate_document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;clauses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;clauses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No clauses extracted from document.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;required_types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;found_types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clauses&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="n"&gt;missing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;required_types&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;found_types&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing required clause types: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;low_confidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clauses&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;low_confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;low_confidence&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low confidence extractions for: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;human_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;interrupt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document validation failed. Please review and provide corrections.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;human_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;corrected_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clauses&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pattern 3: User-Fixable Errors with interrupt()
&lt;/h2&gt;

&lt;p&gt;Some errors can’t be fixed by the system or the LLM. The document is missing a signature date. The clause references an undefined term. The input is ambiguous. These need a human.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;interrupt()&lt;/code&gt; pauses graph execution, saves state to the checkpointer, and returns a payload to the caller. When the human provides input, you resume with &lt;code&gt;Command(resume=...)&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;from langgraph.checkpoint.memory import InMemorySaver&lt;/p&gt;

&lt;p&gt;checkpointer = InMemorySaver()&lt;/p&gt;
&lt;h1&gt;
  
  
  First invocation — runs until interrupt
&lt;/h1&gt;

&lt;p&gt;config = {"configurable": {"thread_id": "contract-review-42"}}&lt;br&gt;
result = graph.invoke(&lt;br&gt;
    {"document": "...", "messages": [], "extracted_clauses": [], "validation_errors": [], "retry_count": 0, "final_summary": ""},&lt;br&gt;
    config,&lt;br&gt;
)&lt;/p&gt;
&lt;h1&gt;
  
  
  Check for interrupt
&lt;/h1&gt;

&lt;p&gt;if "&lt;strong&gt;interrupt&lt;/strong&gt;" in result:&lt;br&gt;
    print("Human input needed:", result["&lt;strong&gt;interrupt&lt;/strong&gt;"])&lt;/p&gt;
&lt;h1&gt;
  
  
  Resume with corrections
&lt;/h1&gt;

&lt;p&gt;corrected = Command(resume={&lt;br&gt;
    "corrected_clauses": [&lt;br&gt;
        {"clause_type": "termination", "text": "Either party may terminate...", "confidence": 0.95},&lt;br&gt;
        {"clause_type": "payment", "text": "Payment due within 30 days...", "confidence": 0.98},&lt;br&gt;
    ]&lt;br&gt;
})&lt;br&gt;
final_result = graph.invoke(corrected, config)&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Resuming after an interrupt:&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;python&lt;/p&gt;

&lt;p&gt;from langsmith import tracing_context&lt;/p&gt;

&lt;p&gt;@traceable(name="process_document", run_type="chain")&lt;br&gt;
def process_document(document: str, thread_id: str) -&amp;gt; dict:&lt;br&gt;
    config = {"configurable": {"thread_id": thread_id}}&lt;br&gt;
    with tracing_context(&lt;br&gt;
        metadata={"document_length": len(document), "thread_id": thread_id},&lt;br&gt;
        tags=["production", "document-pipeline"],&lt;br&gt;
    ):&lt;br&gt;
        return graph.invoke(&lt;br&gt;
            {&lt;br&gt;
                "document": document,&lt;br&gt;
                "messages": [HumanMessage(content=f"Process this contract:\n\n{document}")],&lt;br&gt;
                "extracted_clauses": [],&lt;br&gt;
                "validation_errors": [],&lt;br&gt;
                "retry_count": 0,&lt;br&gt;
                "final_summary": "",&lt;br&gt;
            },&lt;br&gt;
            config,&lt;br&gt;
        )&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical detail:&lt;/strong&gt; &lt;code&gt;interrupt()&lt;/code&gt; requires a checkpointer. Without one, the state is lost and you can’t resume. Use &lt;code&gt;InMemorySaver&lt;/code&gt; for development and a durable checkpointer (Postgres, SQLite) for production. Forgetting the checkpointer is a silent failure — the graph runs fine until you actually need to resume, and then it has no idea where it left off.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pattern 4: Building Fault-Tolerant Agents — Let Unexpected Errors Bubble
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;TypeError&lt;/code&gt;, &lt;code&gt;KeyError&lt;/code&gt;, schema mismatches, logic bugs. Don’t catch these. Don’t retry them. Don’t interrupt for them. &lt;strong&gt;Let them crash.&lt;/strong&gt; A retry just wastes time on an error that will never self-resolve. A human interrupt pages someone to look at a bug that should be in your issue tracker.&lt;/p&gt;

&lt;p&gt;The only thing to do with unexpected errors is make them observable. Wrap your graph invocation and log context:&lt;/p&gt;

&lt;p&gt;``&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;clauses_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the following contract clauses into a concise executive summary. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Flag any compliance concerns.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Contract clauses:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clauses_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;When it crashes, you get a full trace in LangSmith with the document content, the exact node that failed, and every intermediate state. That’s a 5-minute investigation, not a 2-hour log-grepping session.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Summarizer
&lt;/h2&gt;

&lt;p&gt;The final node synthesizes everything. It’s simple but gets &lt;code&gt;RetryPolicy&lt;/code&gt; protection because it calls the LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.checkpoint.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemorySaver&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_continue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;last_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;last_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;post_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Agent node: LLM retry (expensive, conservative)
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm_retry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Tool node: API retry (cheap, aggressive) + error handling for tool failures
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;api_retry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Post-tool processing
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;post_tool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Validation: no retry (errors here are user-fixable, not transient)
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Summarizer: LLM retry
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summarize_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm_retry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;should_continue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;checkpointer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemorySaver&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Graph Assembly
&lt;/h2&gt;

&lt;p&gt;Here’s where the error classification meets the graph structure. Each node gets the retry strategy that matches its error profile:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;from langsmith import tracing_context&lt;/p&gt;

&lt;p&gt;with tracing_context(&lt;br&gt;
    metadata={&lt;br&gt;
        "pipeline_version": "v3",&lt;br&gt;
        "error_strategy": "classified",&lt;br&gt;
        "document_type": "contract",&lt;br&gt;
    },&lt;br&gt;
    tags=["production", "error-handling-v3"],&lt;br&gt;
):&lt;br&gt;
    result = process_document(&lt;br&gt;
        document="Sample contract text...",&lt;br&gt;
        thread_id="contract-42",&lt;br&gt;
    )&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Notice: the validation node has &lt;strong&gt;no retry policy.&lt;/strong&gt; Retrying a missing-clause error 3 times won’t make the clause appear. That’s a user-fixable problem that needs &lt;code&gt;interrupt()&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Production Failures
&lt;/h2&gt;

&lt;p&gt;These are the error-handling mistakes that make it past code review and into production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Retrying User-Fixable Errors.&lt;/strong&gt; The pipeline retries an extraction 3 times, burning 8 seconds and 3 LLM calls, before finally failing with the same “missing payment clause” error. The document genuinely doesn’t have a payment clause. No amount of retrying will create one. Fix: classify the error before choosing the handler. If the document is missing required content, &lt;code&gt;interrupt()&lt;/code&gt; immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Swallowing Unexpected Errors.&lt;/strong&gt; A developer wraps the entire graph invocation in &lt;code&gt;try/except Exception: return {"error": "Something went wrong."}&lt;/code&gt;. Now every &lt;code&gt;TypeError&lt;/code&gt;, every &lt;code&gt;KeyError&lt;/code&gt;, every schema mismatch disappears into a generic error message. The LangSmith trace shows the node completed “successfully” — because from the graph’s perspective, it did. It returned a value. The bug lives in production for weeks until someone notices the output quality degraded. Fix: only catch the specific exception types you know how to handle. Let everything else crash loudly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Superstep Transaction Surprise.&lt;/strong&gt; You have two parallel branches: clause extraction and metadata extraction. The metadata branch succeeds, but the clause branch hits a rate limit and throws. You expect the metadata to be saved — it succeeded, after all. But superstep transactions mean &lt;strong&gt;neither update applies.&lt;/strong&gt; The entire superstep rolls back. Your metadata extraction re-runs on retry (if you have &lt;code&gt;RetryPolicy&lt;/code&gt;) or is lost entirely (if you don’t). Fix: put &lt;code&gt;RetryPolicy&lt;/code&gt; on every node that can fail transiently. LangGraph checkpoints successful nodes within a superstep so they don’t re-execute, but the state update is still atomic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Interrupt Without Checkpointer.&lt;/strong&gt; You add &lt;code&gt;interrupt()&lt;/code&gt; to your validation node and test it locally. Works great. Deploy to production without a persistent checkpointer (or with &lt;code&gt;InMemorySaver&lt;/code&gt; behind a load balancer). The interrupt pauses the graph, the user provides corrections, and... the graph starts from scratch because the in-memory state was on a different server instance. Fix: use a durable checkpointer (&lt;code&gt;PostgresSaver&lt;/code&gt;, &lt;code&gt;SqliteSaver&lt;/code&gt;) in production. &lt;code&gt;InMemorySaver&lt;/code&gt; is for tests only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Error Recovery Loop Explosion.&lt;/strong&gt; The LLM fails to call a tool correctly, the error goes back to the LLM, the LLM tries again with slightly different wrong arguments, the error goes back again. After 15 loops and $2 in API costs, you hit the recursion limit. Fix: add a &lt;code&gt;retry_count&lt;/code&gt; to state. After 3 LLM-recovery attempts, escalate to &lt;code&gt;interrupt()&lt;/code&gt; or fail with a clear error message.&lt;/p&gt;
&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Error handling without observability is guesswork. Here’s how to make every error path visible in LangSmith:&lt;/p&gt;

&lt;p&gt;``&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openevals.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_llm_as_judge&lt;/span&gt;

&lt;span class="n"&gt;ls_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error-handling-evals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document processing pipeline error handling evaluation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This contract between Party A and Party B includes: Termination: Either party may terminate with 30 days notice. Payment: Net 30 terms apply.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This agreement covers liability limitations and indemnification clauses only.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;should_succeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;should_succeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;should_succeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty_document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;@traceable&lt;/code&gt; decorator on every node means you can see in LangSmith:&lt;/p&gt;

&lt;p&gt;Filter by the &lt;code&gt;error-handling-v3&lt;/code&gt; tag to compare error rates across pipeline versions. If v3 has fewer interrupts but more retries, your error classification improved — transient errors are being handled automatically instead of paging humans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;Test error recovery paths the same way you test happy paths. Three evaluators: one for successful processing, one for error classification accuracy, and one LLM-as-judge for output quality under failure conditions.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;QUALITY_PROMPT = """\&lt;br&gt;
Document: {inputs[document]}&lt;br&gt;
Pipeline output: {outputs[final_summary]}&lt;/p&gt;

&lt;p&gt;Rate 0.0-1.0 on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Completeness: Did the summary cover all extracted clauses?&lt;/li&gt;
&lt;li&gt;Accuracy: Are the clause descriptions faithful to the source?&lt;/li&gt;
&lt;li&gt;Error handling: If the document was incomplete, did the pipeline flag it appropriately?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Return ONLY: {{"score": , "reasoning": ""}}"""&lt;/p&gt;

&lt;p&gt;quality_judge = create_llm_as_judge(&lt;br&gt;
    prompt=QUALITY_PROMPT,&lt;br&gt;
    model="anthropic:claude-sonnet-4-5-20250929",&lt;br&gt;
    feedback_key="quality",&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;def error_classification(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:&lt;br&gt;
    """Did the pipeline correctly classify errors?"""&lt;br&gt;
    should_succeed = reference_outputs.get("should_succeed", True)&lt;br&gt;
    has_summary = bool(outputs.get("final_summary"))&lt;br&gt;
    has_errors = bool(outputs.get("validation_errors"))&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if should_succeed:
    score = 1.0 if has_summary and not has_errors else 0.0
else:
    score = 1.0 if has_errors or not has_summary else 0.0
return {"key": "error_classification", "score": score}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;def recovery_efficiency(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:&lt;br&gt;
    """How many retries did it take? Lower is better."""&lt;br&gt;
    retry_count = outputs.get("retry_count", 0)&lt;br&gt;
    if retry_count == 0:&lt;br&gt;
        score = 1.0&lt;br&gt;
    elif retry_count &amp;lt;= 2:&lt;br&gt;
        score = 0.7&lt;br&gt;
    else:&lt;br&gt;
        score = 0.3&lt;br&gt;
    return {"key": "recovery_efficiency", "score": score}&lt;/p&gt;

&lt;p&gt;def target(inputs: dict) -&amp;gt; dict:&lt;br&gt;
    config = {"configurable": {"thread_id": inputs["thread_id"]}}&lt;br&gt;
    try:&lt;br&gt;
        return graph.invoke(&lt;br&gt;
            {&lt;br&gt;
                "document": inputs["document"],&lt;br&gt;
                "messages": [HumanMessage(content=f"Process this contract:\n\n{inputs['document']}")],&lt;br&gt;
                "extracted_clauses": [],&lt;br&gt;
                "validation_errors": [],&lt;br&gt;
                "retry_count": 0,&lt;br&gt;
                "final_summary": "",&lt;br&gt;
            },&lt;br&gt;
            config,&lt;br&gt;
        )&lt;br&gt;
    except Exception as e:&lt;br&gt;
        return {"final_summary": "", "validation_errors": [str(e)], "retry_count": 0}&lt;/p&gt;

&lt;p&gt;results = evaluate(&lt;br&gt;
    target,&lt;br&gt;
    data="error-handling-evals",&lt;br&gt;
    evaluators=[quality_judge, error_classification, recovery_efficiency],&lt;br&gt;
    experiment_prefix="error-handling-v1",&lt;br&gt;
    max_concurrency=2,&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;[Ingest] → [Extract Clauses] → [Validate] → [Summarize] → END&lt;br&gt;
              ↑        |&lt;br&gt;
              |        ↓&lt;br&gt;
              ← (tool error: retry with context)&lt;br&gt;
                       |&lt;br&gt;
                  (missing info: interrupt for human)&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;  &lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;recovery_efficiency&lt;/code&gt; catches the error-loop explosion problem. If your average retry count creeps above 2, your error classification is wrong — you’re retrying things that should interrupt or bubble up.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use classified error handling when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your pipeline calls external APIs that can return transient errors&lt;/li&gt;
&lt;li&gt;Documents have variable quality and may be missing required fields&lt;/li&gt;
&lt;li&gt;You need human-in-the-loop for ambiguous or incomplete inputs&lt;/li&gt;
&lt;li&gt;You're running batch processing where one failure shouldn't kill the batch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip it when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your pipeline is a single LLM call with no tools&lt;/li&gt;
&lt;li&gt;Every error is the same type (all transient, all user-fixable)&lt;/li&gt;
&lt;li&gt;You're prototyping and don't need production resilience yet&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;The error classification matrix is the whole strategy: transient errors get &lt;code&gt;RetryPolicy&lt;/code&gt;, LLM-recoverable errors get stored in state and looped back, user-fixable errors get &lt;code&gt;interrupt()&lt;/code&gt;, and unexpected errors crash loudly. Four patterns, four primitives, zero catch-all try/excepts.&lt;/p&gt;

&lt;p&gt;The mistake everyone makes is treating errors as a single category. You either retry everything (wasting time and money) or catch everything (hiding bugs). The classification forces you to ask “who fixes this?” for every failure mode, and that question is worth more than any amount of retry logic.&lt;/p&gt;

&lt;p&gt;Put &lt;code&gt;RetryPolicy&lt;/code&gt; on every node that touches a network. Put &lt;code&gt;handle_tool_errors=True&lt;/code&gt; on your &lt;code&gt;ToolNode&lt;/code&gt;. Put &lt;code&gt;interrupt()&lt;/code&gt; on validation failures. Let everything else crash. Ship the &lt;code&gt;recovery_efficiency&lt;/code&gt; eval before you ship the pipeline.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical References:  &lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/focused-dot-io/08-error-handling-production/tree/c7fd49401a5c9adf03cd0f90ac08117820013f1e#article" rel="noopener noreferrer"&gt;LangGraph Agent Error Handling in Production GitHub Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://concepts" rel="noopener noreferrer"&gt;LangGraph Retry Policy (Handling Retries)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://langchain-ai.github.io/langgraph/how-tos/tool-calling/" rel="noopener noreferrer"&gt;LangGraph Tool Calling and ToolNode&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://langchain-ai.github.io/langgraph/how-tos/human-in-the-loop/" rel="noopener noreferrer"&gt;LangGraph Human-in-the-Loop (interrupt)&lt;/a&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>Evaluation Pipelines for LangGraph Agents</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 16 Apr 2026 00:43:37 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/evaluation-pipelines-for-langgraph-agents-2aoi</link>
      <guid>https://dev.to/focused_dot_io/evaluation-pipelines-for-langgraph-agents-2aoi</guid>
      <description>&lt;p&gt;You changed a system prompt. It looks better on the three examples you tried. You ship it. Tuesday morning, support tickets spike. The agent is now hallucinating policy details on a class of queries you didn’t test. You revert, but 400 users already got bad answers.&lt;/p&gt;

&lt;p&gt;This is not a testing problem. You have unit tests.&lt;/p&gt;

&lt;p&gt;This is an evaluation problem.&lt;/p&gt;

&lt;p&gt;Traditional tests check “does the code run.” Evals check “is the output good.” For LLM applications, you need a clear verdict: pass or fail. Not a 0.73. Not “mostly correct.” The agent either got the answer right or it didn’t.&lt;/p&gt;

&lt;p&gt;Binary evaluators give you that clarity. More importantly, they give your CI pipeline a gate that actually means something.&lt;/p&gt;

&lt;p&gt;The cost of not having evals is not “we might ship a bad prompt.” It is that you have no idea if any prompt is good.&lt;/p&gt;

&lt;p&gt;LangSmith gives you the pieces: datasets with versioned examples, custom evaluators (deterministic and LLM-as-judge), trajectory evaluation for agent behavior, experiment comparison across runs, and production trace monitoring.&lt;/p&gt;

&lt;p&gt;This post builds the whole pipeline, from dataset to CI regression detection.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Eval Tax
&lt;/h3&gt;

&lt;p&gt;Every team resists evals because they seem expensive. Here's the actual math:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Activity&lt;/th&gt;
&lt;th&gt;Time Cost&lt;/th&gt;
&lt;th&gt;Without Evals&lt;/th&gt;
&lt;th&gt;With Evals&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt change&lt;/td&gt;
&lt;td&gt;~30 min&lt;/td&gt;
&lt;td&gt;Ship and pray&lt;/td&gt;
&lt;td&gt;Run eval suite, check pass rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regression discovery&lt;/td&gt;
&lt;td&gt;Hours–days&lt;/td&gt;
&lt;td&gt;User reports, support tickets&lt;/td&gt;
&lt;td&gt;Caught in CI, before merge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Root cause analysis&lt;/td&gt;
&lt;td&gt;1–4 hours&lt;/td&gt;
&lt;td&gt;Manual trace inspection&lt;/td&gt;
&lt;td&gt;Failed evals pinpoint exactly which capability regressed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback decision&lt;/td&gt;
&lt;td&gt;Stressful&lt;/td&gt;
&lt;td&gt;"Is this really worse?"&lt;/td&gt;
&lt;td&gt;Pass rate dropped from 95% to 71%, clear signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cost per change&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Unpredictable&lt;/td&gt;
&lt;td&gt;~15 min eval run&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The eval suite described below costs ~$0.50 per run (LLM-as-judge calls) and takes 2–3 minutes. The alternative is discovering regressions from users. &lt;/p&gt;

&lt;h2&gt;
  
  
  ** &lt;em&gt;The Architecture&lt;/em&gt;**
&lt;/h2&gt;

&lt;p&gt;We're building an evaluation pipeline for a Q&amp;amp;A agent. The pipeline covers offline evals (before deploy), online monitoring (after deploy), and regression detection (across deploys).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatAnthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnyMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SystemMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.prebuilt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools_condition&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RetryPolicy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;traceable&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;AnyMessage&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Search the internal knowledge base for relevant information.

    Args:
        query: The search query to find relevant documents.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;knowledge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Full refund within 30 days of purchase for unopened items. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Opened items eligible for exchange only within 14 days. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Digital products are non-refundable after download.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipping&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Standard shipping: 5-7 business days, free over $50. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Express shipping: 2-3 business days, $12.99. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;International shipping: 10-15 business days, $24.99.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warranty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All electronics carry a 1-year manufacturer warranty. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extended warranty available for $49.99 (adds 2 years). &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Warranty does not cover accidental damage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hours&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer support available Monday-Friday 9am-6pm EST. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chat support available 24/7. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Phone support: 1-800-555-0123.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;knowledge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Knowledge Base [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No relevant results found. Try rephrasing your query.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;llm_with_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are a customer support agent. Answer questions using the knowledge base tool.
Be concise and accurate. If the knowledge base doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have the answer, say so —
do not make up information. Always cite the source when using knowledge base results.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa_agent_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_with_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="n"&gt;tool_node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_tool_errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RetryPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools_condition&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;qa_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Agent Under Test
&lt;/h2&gt;

&lt;p&gt;A Q&amp;amp;A agent that answers questions using a knowledge base. Simple enough to evaluate clearly, complex enough to have real failure modes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;ls_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q&amp;amp;A agent evaluation dataset covering core support topics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is your refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How long does express shipping take?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does the warranty cover water damage?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are your support hours?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Can I return a digital download?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do you sell gift cards?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the refund window for opened electronics?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Full refund within 30 days for unopened items. Opened items eligible for exchange only within 14 days. Digital products are non-refundable after download.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unopened&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;14 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Express shipping takes 2-3 business days and costs $12.99.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2-3 business days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12.99&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The warranty does not cover accidental damage, including water damage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;does not cover&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accidental damage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer support is available Monday-Friday 9am-6pm EST. Chat support is available 24/7.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Monday-Friday&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9am-6pm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24/7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Digital products are non-refundable after download.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;digital&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-refundable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have information about gift cards in the knowledge base.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expects_no_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Opened items are eligible for exchange only within 14 days.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;14 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Create a Dataset
&lt;/h2&gt;

&lt;p&gt;The dataset is the foundation. Bad examples produce misleading eval scores. Each example has &lt;code&gt;inputs&lt;/code&gt; (what goes to the agent) and &lt;code&gt;outputs&lt;/code&gt; (the ground truth to evaluate against).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;traceable&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_keyword_coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;keyword_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pass if the response mentions ALL required keywords. Fail if any are missing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;must_mention&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;must_mention&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword_coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;must_mention&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword_coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;must_mention&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_tool_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tool_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pass if the agent called all expected tools. Fail if any are missing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="n"&gt;expected_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
    &lt;span class="n"&gt;actual_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
            &lt;span class="n"&gt;actual_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;expected_tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;expected_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;issubset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_tools&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_no_hallucination_on_missing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;no_hallucination_on_missing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;When the KB has no answer, pass if the agent admits it. Fail if it fabricates.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expects_no_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_hallucination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;hedging_phrases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have information&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no information&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not available&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cannot find&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no relevant results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not in the knowledge base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m not sure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;hedged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phrase&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;phrase&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hedging_phrases&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_hallucination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hedged&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Seven examples is a starting point, not a finish line.&lt;/strong&gt; In production, you need 50-100 examples covering happy paths, edge cases, and adversarial inputs. But starting with seven well-chosen examples that cover your core failure modes is better than starting with zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Build LangSmith Evaluation Evaluators
&lt;/h2&gt;

&lt;p&gt;Three layers of evaluation: deterministic checks (fast, cheap, reliable), LLM-as-judge (flexible, handles nuance), and trajectory evaluation (validates agent behavior, not just output).&lt;/p&gt;

&lt;h3&gt;
  
  
  Deterministic Evaluators
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openevals.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_llm_as_judge&lt;/span&gt;
&lt;span class="n"&gt;CORRECTNESS_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are evaluating a customer support agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s response.
Customer question: {inputs[question]}
Agent response: {outputs[response]}
Expected answer: {reference_outputs[expected_answer]}
Determine whether the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s response is correct.
A response is CORRECT if it:
- Contains the key factual claims from the expected answer
- Does not contradict the expected answer
- Does not fabricate information beyond what the knowledge base provides
A response is INCORRECT if it:
- Misses critical factual information from the expected answer
- States anything that contradicts the expected answer
- Invents details not present in the knowledge base
Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true}} or {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false}}
with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; field explaining your verdict.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;correctness_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_llm_as_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CORRECTNESS_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;feedback_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correctness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;TONE_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are evaluating the tone and professionalism of a customer support agent.
Customer question: {inputs[question]}
Agent response: {outputs[response]}
Determine whether the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s tone is ACCEPTABLE or UNACCEPTABLE.
ACCEPTABLE tone: professional, helpful, concise, empathetic, and action-oriented.
UNACCEPTABLE tone: condescending, rude, excessively verbose, robotic, dismissive,
or inappropriately casual for a support context.
Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true}} or {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false}}
with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; field explaining your verdict.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;tone_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_llm_as_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TONE_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;feedback_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why pass/fail instead of continuous scores?&lt;/strong&gt; You don't ship code when 73% of your unit tests "mostly pass." When keyword_coverage fails, you know exactly what happened: the agent missed a required term. A score of 0.75 tells you something is partially wrong, but you still have to go figure out what. And binary evaluators don't suffer from judge variance — the same input produces the same verdict every time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentevals.trajectory.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_trajectory_llm_as_judge&lt;/span&gt;
&lt;span class="n"&gt;TRAJECTORY_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are evaluating whether an AI agent took a reasonable path to answer a question.
The agent has access to a knowledge base search tool.
Evaluate the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s trajectory (sequence of actions and messages):
{outputs}
A trajectory PASSES if:
- The agent called the appropriate tool(s) for the question
- The agent did not make unnecessary or redundant tool calls
- The agent used tool results to formulate its response
- The agent did not ignore relevant tool results
A trajectory FAILS if:
- The agent skipped tool calls and answered from its own knowledge
- The agent made excessive redundant calls (more than 2 calls for a simple question)
- The agent ignored tool results and fabricated an answer
- The agent called completely irrelevant tools
Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true}} or {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false}}
with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; field explaining your verdict.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;trajectory_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_trajectory_llm_as_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TRAJECTORY_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_trajectory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;trajectory_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pass if the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s tool-calling trajectory was reasonable. Fail otherwise.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trajectory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;trajectory_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trajectory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Binary judges are more reliable than continuous ones.&lt;/strong&gt; Ask a model to rate something 0.0-1.0 and you'll get different scores on every run. Ask it "correct or incorrect?" and you'll get the same answer 95%+ of the time. The judge isn't deciding how correct, it's deciding whether the response meets a bar. Easier task, more consistent results, fewer false signals in CI.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Trajectory Evaluator&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Trajectory evaluation checks not just &lt;strong&gt;what&lt;/strong&gt; the agent said, but &lt;strong&gt;how&lt;/strong&gt; it got there. Did it call the right tools? Did it call them in a reasonable order? Did it over-call or under-call?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa_eval_target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qa_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;evaluators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;correctness_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tone_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;keyword_coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;no_hallucination_on_missing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trajectory_eval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;experiment_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Run the Evaluation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Wire the target function and evaluators into evaluate(). The target function takes dataset inputs, runs the agent, and returns a dict with the keys your evaluators expect.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compare_experiments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compare_experiments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;baseline_prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;candidate_prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;regression_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compare two experiment runs and flag regressions in pass rates.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;experiments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_projects&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;project_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;reference_dataset_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;experiments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_prefix&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_prefix&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Could not find both experiments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;baseline_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_test_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;candidate_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_test_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_pass_rates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feedback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
                &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;baseline_rates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_pass_rates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;candidate_rates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_pass_rates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;regressions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;baseline_rates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidate_rates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;candidate_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;baseline_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;regression_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;regressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline_pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;candidate_pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regressions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;regressions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline_rates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;baseline_rates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;candidate_rates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidate_rates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates an experiment in LangSmith, a versioned snapshot of your agent's performance. The result is a pass rate per evaluator: "correctness: 6/7 passed, tool_usage: 7/7 passed, keyword_coverage: 5/7 passed." Every future eval run with a different experiment_prefix becomes a comparable data point.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: AI Agent Testing with Regression Detection&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The power of experiments is comparison. When you change a prompt, model, or tool, run the same eval suite and compare pass rates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tracing_context&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_user_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Production entry point with trace tagging.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;tracing_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;channel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-02-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qa_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, you run this in CI. A prompt change creates a new experiment, the comparison script checks for regressions, and the PR is blocked if any evaluator's pass rate drops more than 10%. This is the single most valuable thing you can build with LangSmith — the rest is instrumentation. The regression threshold is 10%, not 5%, because pass/fail metrics move in discrete jumps. On a 7-example dataset, one additional failure drops your pass rate by ~14%. On a 50-example dataset, you can tighten the threshold to 5%. Scale the threshold to your dataset size.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Production Monitoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Offline evals catch regressions before deploy. Online monitoring catches drift after deploy — the slow degradation that happens when user behavior shifts, knowledge bases get stale, or upstream APIs change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_tool_call_efficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tool_call_efficiency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fail if the agent made more than 3 tool calls for a single question.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="n"&gt;tool_call_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call_efficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_call_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tag every production trace with the agent version and prompt version. When you deploy a new version, you can filter traces by version and compare pass rates across versions — with real user traffic, not synthetic examples. The monitoring loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tag all production traces with version metadata&lt;/li&gt;
&lt;li&gt;Configure LangSmith online evaluators to sample 10-20% of traces&lt;/li&gt;
&lt;li&gt;Dashboard alerts on pass rate drops by version&lt;/li&gt;
&lt;li&gt;When a drop is detected, pull the failing traces, add them to your offline dataset, and fix&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Production Failures&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;These are the eval-specific failure modes that surface once you're running evals in CI and production. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Flaky LLM-as-Judge Verdicts.&lt;/strong&gt; The same input/output pair passes on one run and fails on the next. The judge model is non-deterministic, and your eval is measuring judge variance, not agent quality. Fix: set temperature=0 on the judge model, make your pass/fail criteria as specific as possible (list exactly what constitutes a pass), and run each evaluation 3 times with a majority-vote. If the same example flips verdict more than 10% of the time, your criteria need to be sharper. Binary verdicts are already far more stable than continuous scores, but ambiguous criteria still cause flakiness. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Eval Gaming.&lt;/strong&gt; You optimize the prompt to pass the eval dataset. The pass rate goes up. User satisfaction doesn't. Your dataset is too narrow, the agent learned your test distribution, not the actual problem. Fix: rotate examples into and out of the eval set quarterly. Pull 10% of examples from production traces each month. Never let the eval set become stale. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Judge Model Disagreement.&lt;/strong&gt; You switch the judge from Claude to GPT and pass rates shift by 20%. The evaluator is measuring model preference, not quality. Fix: calibrate your judge against human ratings. Run 50 examples through both the judge and a human annotator. If they disagree on more than 10% of verdicts, your judge criteria need work. openevals provides pre-calibrated prompts as a starting point. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Dataset Drift.&lt;/strong&gt; Your eval dataset was created six months ago. The product has changed, new policies, new features, different user behavior. The evals are passing but they're testing scenarios that no longer matter, while ignoring scenarios that do. Fix: timestamp your examples. Review the dataset monthly. Add production failure cases as they occur. Delete examples for deprecated features. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Trajectory Eval False Positives.&lt;/strong&gt; The trajectory judge says the agent's path was "reasonable" even when the agent called the wrong tool first and then self-corrected. Self-correction is fine in production but expensive, it adds latency and cost. Fix: add a separate tool_call_count evaluator that fails trajectories with more than N tool calls. Combine the trajectory pass/fail with an efficiency gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tracing_context&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;tracing_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_run&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trigger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ci&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ci&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;evaluators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;correctness_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tone_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;keyword_coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tool_usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;no_hallucination_on_missing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;trajectory_eval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tool_call_efficiency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;experiment_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Observability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Every evaluator is @traceable, which means your evals themselves are traced in LangSmith. This matters more than you think. When an evaluator produces a surprising verdict, you can inspect exactly what it saw and why it ruled that way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openevals.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_llm_as_judge&lt;/span&gt;
&lt;span class="n"&gt;ls_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;QUALITY_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;Question: {inputs[question]}
Response: {outputs[response]}
Expected: {reference_outputs[expected_answer]}
Does the response correctly and completely answer the question
based on the expected answer?
PASS if the response contains all key facts from the expected answer
and does not contradict it.
FAIL if the response misses critical information, contradicts the
expected answer, or fabricates details.
Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true}} or {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false}}
with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; field.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;quality_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_llm_as_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;QUALITY_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;feedback_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pass if ALL required terms are present. Fail if any are missing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;must_mention&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;must_mention&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;all_present&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;must_mention&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;all_present&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;response_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fail if the response is too short to be useful or excessively long.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;word_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;word_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qa_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;evaluators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;quality_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_length&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;experiment_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-quick-check&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What to watch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Correctness pass rate.&lt;/strong&gt; This is your north star. If it drops, the agent is giving wrong answers. Every other metric is secondary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool usage pass rate.&lt;/strong&gt; If this drops, the agent stopped using the knowledge base — probably a prompt regression that caused it to answer from parametric memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-hallucination pass rate.&lt;/strong&gt; If this drops, the agent is making up answers when it should be admitting ignorance. This is the most dangerous regression and the one most likely to slip through manual review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failing examples across runs.&lt;/strong&gt; Track which specific examples fail consistently. These are your hardest cases — either improve the agent to handle them or accept them as known limitations and document them.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Evals&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This section is the article, but here's the condensed eval suite for quick reference — the minimum viable pipeline you should have before shipping any agent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="err"&gt;┌─────────────────────────────────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                    &lt;span class="n"&gt;OFFLINE&lt;/span&gt; &lt;span class="n"&gt;EVALS&lt;/span&gt;                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;┌──────────┐&lt;/span&gt;    &lt;span class="err"&gt;┌──────────┐&lt;/span&gt;    &lt;span class="err"&gt;┌──────────────────┐&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;  &lt;span class="err"&gt;│───►│&lt;/span&gt; &lt;span class="n"&gt;Target&lt;/span&gt;   &lt;span class="err"&gt;│───►│&lt;/span&gt; &lt;span class="n"&gt;Evaluators&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Function&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Correctness&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;refs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Completeness&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;└──────────┘&lt;/span&gt;    &lt;span class="err"&gt;└──────────┘&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Trajectory&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Judge&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;└────────┬─────────┘&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                           &lt;span class="err"&gt;│&lt;/span&gt;             &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;┌────────▼─────────┐&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Experiment&lt;/span&gt;        &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;versioned&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;└──────────────────┘&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└─────────────────────────────────────────────────────────┘&lt;/span&gt;

&lt;span class="err"&gt;┌─────────────────────────────────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                   &lt;span class="n"&gt;ONLINE&lt;/span&gt; &lt;span class="n"&gt;MONITORING&lt;/span&gt;                      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;┌──────────┐&lt;/span&gt;    &lt;span class="err"&gt;┌──────────┐&lt;/span&gt;    &lt;span class="err"&gt;┌──────────────────┐&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Prod&lt;/span&gt;     &lt;span class="err"&gt;│───►│&lt;/span&gt; &lt;span class="n"&gt;Traces&lt;/span&gt;   &lt;span class="err"&gt;│───►│&lt;/span&gt; &lt;span class="n"&gt;Online&lt;/span&gt; &lt;span class="n"&gt;Evals&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Traffic&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sampling&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;└──────────┘&lt;/span&gt;    &lt;span class="err"&gt;└──────────┘&lt;/span&gt;    &lt;span class="err"&gt;└──────────────────┘&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└─────────────────────────────────────────────────────────┘&lt;/span&gt;

&lt;span class="err"&gt;┌─────────────────────────────────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                 &lt;span class="n"&gt;REGRESSION&lt;/span&gt; &lt;span class="n"&gt;DETECTION&lt;/span&gt;                     &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="n"&gt;Experiment&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;  &lt;span class="err"&gt;◄────&lt;/span&gt; &lt;span class="n"&gt;compare&lt;/span&gt; &lt;span class="err"&gt;────►&lt;/span&gt;  &lt;span class="n"&gt;v2&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="n"&gt;Δ&lt;/span&gt; &lt;span class="n"&gt;correctness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.15&lt;/span&gt;  &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;REGRESSION&lt;/span&gt; &lt;span class="n"&gt;DETECTED&lt;/span&gt;            &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└─────────────────────────────────────────────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;When to Use This&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Build a full eval pipeline when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're shipping prompt changes more than once a month&lt;/li&gt;
&lt;li&gt;More than one person works on the agent&lt;/li&gt;
&lt;li&gt;The agent handles queries where wrong answers have consequences (policy, pricing, compliance)&lt;/li&gt;
&lt;li&gt;You need to compare model versions (Claude vs GPT, Sonnet vs Haiku)&lt;/li&gt;
&lt;li&gt;You're running A/B tests on agent behavior &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Start with just deterministic evals when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent is in early development and schemas are changing weekly&lt;/li&gt;
&lt;li&gt;You have fewer than 5 test cases&lt;/li&gt;
&lt;li&gt;The output format is structured (JSON extraction) and correctness is binary &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip evals when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The application is a prototype that won't see real users&lt;/li&gt;
&lt;li&gt;You're the only user and you'll notice regressions immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Bottom Line&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Evals are not a nice-to-have. They're the difference between "I think the prompt is better" and "I know the prompt is better, and here are the pass rates." Dataset, evaluators, experiment, comparison. Four components. ~20 lines per evaluator. The payoff is catching every regression before your users do. Pass/fail is the right default. Continuous scores feel more sophisticated, but they create ambiguity — is 0.72 good? Is a drop from 0.81 to 0.76 a regression or noise? Pass/fail kills the question. Green or red. When you need more nuance, add more evaluators with sharper criteria instead of adding decimal places to existing ones. Start with three: one deterministic keyword check, one LLM-as-judge for correctness, one for tool usage. Run them on every PR that touches agent code. Add trajectory evaluation when your agent has more than two tools. Add production monitoring when you have traffic. And update the dataset — the dataset that stops growing is the one that stops catching bugs.&lt;/p&gt;

&lt;p&gt;Technical Resources&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/focused-dot-io/07-evaluation-testing/tree/746f37847075391b6a638be55ce8f2507c55f231" rel="noopener noreferrer"&gt;Eval Pipelines GitHub Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.smith.langchain.com/evaluation" rel="noopener noreferrer"&gt;LangSmith Evaluation (datasets, evaluators, experiments)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.langchain.com/langgraph" rel="noopener noreferrer"&gt;LangGraph (stateful agents, tool calling, orchestration)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.smith.langchain.com/observability" rel="noopener noreferrer"&gt;LangSmith Tracing &amp;amp; Observability&lt;/a&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>programming</category>
      <category>ai</category>
      <category>langchain</category>
    </item>
    <item>
      <title>Debugging Your RAG Application: A LangChain, Python, and OpenAI Tutorial</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:42:32 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/debugging-your-rag-application-a-langchain-python-and-openai-tutorial-4gke</link>
      <guid>https://dev.to/focused_dot_io/debugging-your-rag-application-a-langchain-python-and-openai-tutorial-4gke</guid>
      <description>&lt;p&gt;Let's explore a real-world example of debugging a RAG-type application. I recently undertook this process while updating our company knowledge base -- a resource for potential clients and employees to learn about us.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack:
&lt;/h2&gt;

&lt;p&gt;I work with Python and the LangChain framework, specifically using LangChain Expression Language (LCEL) to build chains. You can find the LangChain LCEL documentation &lt;a href="https://python.langchain.com/docs/how_to/#langchain-expression-language-lcel" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
This approach services as a good alternative to LangChain's debugging tool, &lt;a href="https://www.langchain.com/langsmith" rel="noopener noreferrer"&gt;LangSmith&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Load memory
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_session_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;return_messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_loaded_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;get_session_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;load_memory_variables&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_memory_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RunnablePassthrough&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RunnableLambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_get_loaded_memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create Question
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_question_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;get_buffer_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                               &lt;span class="p"&gt;}&lt;/span&gt;
                               &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;CONDENSE_QUESTION_PROMPT&lt;/span&gt;
                               &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;
                               &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nc"&gt;StrOutputParser&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Retrieve Documents
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Answer
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_answer_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;final_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;combine_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;DEFAULT_DOCUMENT_PROMPT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;final_inputs&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;ANSWER_PROMPT&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Final Chain looks like this
&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_memory_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;create_question_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;create_answer_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While debugging, I prefer using a cheaper model like gpt-3.5-turbo for its cost-effectiveness. The less advanced models are more than adequate for basic testing. For final testing and deployment to production, you might consider upgrading to gpt-4-turbo or a similar advanced model.&lt;br&gt;&lt;br&gt;
I also favor Jupyter notebooks for much of my debugging. This way, I can include the notebook in a .gitignore file, reducing cleanup from debugging shenanigans in my main code. I can also run very specific pieces of my code without plumbing overhead.  &lt;/p&gt;
&lt;h2&gt;
  
  
  Initial Observations
&lt;/h2&gt;

&lt;p&gt;I noticed that basic queries received correct answers, but any follow-up question would lack the appropriate context, indicating that conversational memory was no longer functioning effectively.&lt;br&gt;&lt;br&gt;
Here's what I observed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;Love&lt;/span&gt; &lt;span class="n"&gt;Your&lt;/span&gt; &lt;span class="n"&gt;Craft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Listen&lt;/span&gt; &lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;Learn&lt;/span&gt; &lt;span class="n"&gt;Why&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Based&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;given&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;importance&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Red&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;Driven&lt;/span&gt; &lt;span class="nc"&gt;Development &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TDD&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt; &lt;span class="err"&gt;❌&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, I expected responses more in line with explanations like "Love your craft is when you are passionate about what you do."&lt;br&gt;&lt;br&gt;
For more context, this issue with conversational memory arose while I was implementing a new feature: allowing end users to customize responses based on their role. So, for example, a developer could receive a highly technical answer while a marketing manager would see more high-level details.  &lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Debugging Steps&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. &lt;strong&gt;Ensure Role Feature Integrity&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To avoid impacting the newly implemented role feature, I made it overly obvious and active in every response during this debugging session by temporarily updating my system prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer the question from the perspective of a {role}.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;DEBUGGING_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer the question in a {role} accent.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's how the AI responded, clearly adhering to my updated prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;
&lt;span class="n"&gt;Role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pirate&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;Love&lt;/span&gt; &lt;span class="n"&gt;Your&lt;/span&gt; &lt;span class="n"&gt;Craft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Listen&lt;/span&gt; &lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;Learn&lt;/span&gt; &lt;span class="n"&gt;Why&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;matey&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;talkin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; about the importance of reachin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Red&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="n"&gt;stage&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;Driven&lt;/span&gt; &lt;span class="n"&gt;Development&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Creating a Visual Representation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I created a diagram of the app to visualize the process flow.&lt;br&gt;&lt;br&gt;
I began at the end of my flow and worked backward to identify issues. I first checked whether my LLM was answering questions based on the provided context. Upon inspecting the sources, I realized that the given context was a blog on TDD.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://focusedlabs.io/blog/tdd-first-step-think&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thus, I ruled out the answer component as the source of the bug.  &lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Tracing the Bug's Origin&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Next, I examined the logic for retrieving documents. I added a 'standalone question' key to every input and output chain to log runtime values, which revealed that questions were being incorrectly rephrased.&lt;br&gt;&lt;br&gt;
&lt;em&gt;Adding these keys to the chains allows us to log the values seen by the components at runtime. Using breakpoints will only show the code when it's instantiated and not populated with real-time values.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Code Snippet with added keys
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_answer_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;final_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added 
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;standalone_question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;standalone_question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;can&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="err"&gt;❌&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I expected the &lt;code&gt;standalone_question&lt;/code&gt; to be more specific, like "What can you tell me about the core value of Love your Craft?"  &lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Identifying the Exact Source&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I focused on the &lt;code&gt;chat_history&lt;/code&gt; variable, suspecting an issue with how the chat history was being recognized.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_answer_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;final_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added 
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="err"&gt;❌&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Found the issue! Since the &lt;code&gt;chat_history&lt;/code&gt; was blank, it wasn't being loaded as I had assumed.  &lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Implementing the Solution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I resolved the issue by checking my conversation memory store. As a &lt;code&gt;dict&lt;/code&gt;, the conversation memory store was sensitive to the type of saved messages. I saved the messages with a &lt;code&gt;str&lt;/code&gt; converted version of &lt;code&gt;session_id&lt;/code&gt;. But, I invoked with an &lt;code&gt;Optional[UUID]&lt;/code&gt; version. So, while the conversation memory store itself was set up correctly, I needed to update how I invoked my chain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Therefore, I updated the &lt;code&gt;session_id&lt;/code&gt; type to &lt;code&gt;str&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. &lt;strong&gt;Confirming the Fix&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I confirmed that the conversation memory now functioned correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;What are the Focused Labs core values?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;standalone_question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Can&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;provide&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;information&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Love&lt;/span&gt; &lt;span class="n"&gt;Your&lt;/span&gt; &lt;span class="n"&gt;Craft&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;This&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="n"&gt;means&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;we&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;passionate&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;being&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;what&lt;/span&gt; &lt;span class="n"&gt;we&lt;/span&gt; &lt;span class="n"&gt;do&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;paying&lt;/span&gt; &lt;span class="n"&gt;attention&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;every&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://www.notion.so/Who-are-we-c42efb179fa64f6bb7866deb363fb7ef&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7. &lt;strong&gt;Final Cleanup and Future-Proofing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I reverted back from the temporary pirate accent debug feature used for easy identification of the role feature.&lt;br&gt;&lt;br&gt;
I decided to maintain detailed logging within the system for future debugging efforts.  &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Key Takeaways&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Debugging AI Systems:&lt;/strong&gt; A mix of traditional and AI-specific debugging techniques is essential.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opting for Cost-Effective Models:&lt;/strong&gt; Use more affordable models to reduce costs during repeated queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Importance of Transparency:&lt;/strong&gt; Clear visibility into each step and component of your RAG accelerates debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type Consistency:&lt;/strong&gt; Paying attention to small details, like variable types, can significantly impact functionality.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Thanks for reading!&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Stay tuned for more insights into the world of software engineering and AI. Have questions or insights? Feel free to share them in the comments below!&lt;/p&gt;

</description>
      <category>programming</category>
      <category>python</category>
      <category>langchain</category>
      <category>ai</category>
    </item>
    <item>
      <title>OpenAI Assistants: Limited, but Incredible</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:42:29 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/openai-assistants-limited-but-incredible-2lik</link>
      <guid>https://dev.to/focused_dot_io/openai-assistants-limited-but-incredible-2lik</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;In the ever-evolving realm of AI development, OpenAI has ventured into the domain of &lt;a href="https://dev.to/lab/a-practical-guide-for-migrating-classic-langchain-agents-to-langgraph"&gt;Retrieval Augmented Generation&lt;/a&gt; (RAG) with its latest offering, &lt;a href="https://platform.openai.com/docs/assistants/overview" rel="noopener noreferrer"&gt;Assistants&lt;/a&gt;. Assistants with Retrieval represents OpenAI's attempt to harness the power of RAG — a technique widely embraced and refined within open-source libraries to augment an AI's knowledge with custom information. Think ChatGPT, but with the ability to ask it about custom information. While it doesn't yet lead the pack in creating AI-driven knowledge bases, its inception marks a significant step towards more sophisticated and nuanced AI applications. As developers, our dive into this beta tool is not just about evaluating its current capabilities, but also understanding its place in the broader context of RAG's evolution and its potential to reshape our approaches to LLMs.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Incredible Accuracy for Incredible Ease of Use&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;OpenAI's RAG tool is an awesome prospect for developers eager to explore the nuances of retrieval-augmented models. It serves as a canvas for experimentation, offering a glimpse into the future of sophisticated AI applications. Here's why it's a valuable experimental platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ease of Use&lt;/strong&gt; : Its user-friendly nature invites developers to explore and experiment with minimal setup. This allows more people to try out more use cases to learn where LLM retrieval could improve their workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Respectable Accuracy&lt;/strong&gt; : We swapped out our gpt-3.5-turbo model in our custom chatbot with an OpenAI Assistant, and saw a slightly-lower, but similar level of accuracy. This is so cool! Being able to spin up a custom chatbot in a couple of hours with ~75% accuracy rate is incredible! For more information about our custom chatbot, visit our website: &lt;a href="https://dev.to/capabilities"&gt;focused.io/capabilities&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Understanding the Limitations: A Beta Analysis&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In its beta stage, this tool has its limitations. The current landscape of RAG in open-source libraries such as &lt;a href="https://www.langchain.com/" rel="noopener noreferrer"&gt;Langchain&lt;/a&gt; sets a high benchmark, and OpenAI's iteration is an ambitious stride into this territory. However, with ambition comes the teething problems of any beta technology. Here's what developers should be aware of:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Source Citation Feature&lt;/strong&gt; : A vital feature for gaining user trust, source citation, is not yet functional, indicating future improvements but also a present gap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document Limitations&lt;/strong&gt; : An assistant supports only 20 documents, each up to 512 MB, which is not scalable for most enterprise datasets. For smaller data sets, while the file size itself is not limiting, combining multiple smaller files into 1 larger file is an anti-pattern resulting in loss of structure and context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of Customization&lt;/strong&gt; : While the simple interface facilitates quick setup times, high abstraction levels mean less control for developers to adjust and optimize the tool for specific use cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polling Over Streaming&lt;/strong&gt; : Without streaming capabilities, developers are left with polling methods, which hamper real-time efficiency.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Tips for Maximizing Your Experience&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Navigating a beta tool requires a mix of patience and strategy. To make the most of your journey with OpenAI's RAG tool, keep these tips in mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Management&lt;/strong&gt; : Efficiently handle your limited document space by converting to and concatenating data wherever possible. We recommend using *.txt formats for ease and compression.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File Type Performance&lt;/strong&gt; : Experiment with different file types to discover which yields the best results for your specific case. For us, *.txt returned more accurate results over PDFs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use with Other Libraries:&lt;/strong&gt; Langchain supports Open AI Assistants. While Assistants may not be the leader on their own, they are still powerful when combined with other techniques. I also recommend using &lt;a href="https://llamahub.ai/?tab=loaders" rel="noopener noreferrer"&gt;LlamaHub's loaders&lt;/a&gt; to help integrate with various data sources.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;OpenAI's entry into the RAG space is a testament to the ongoing evolution of AI and machine learning technologies. While this particular tool may not be ready for production, it offers a valuable learning curve for developers keen on the future of retrieval-augmented models. By engaging with it critically and creatively, we can contribute to its growth and simultaneously expand our own understanding of where RAG can take us in the realm of AI development.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>langchain</category>
      <category>ai</category>
    </item>
    <item>
      <title>Enhancing AI Apps with Streaming: Practical Tips for Smoother AI Generation</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:41:57 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/enhancing-ai-apps-with-streaming-practical-tips-for-smoother-ai-generation-a77</link>
      <guid>https://dev.to/focused_dot_io/enhancing-ai-apps-with-streaming-practical-tips-for-smoother-ai-generation-a77</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;Interested in building an AI-powered app using generative models? You're on the right track! The realm of generative AI is brimming with untapped potential. A popular approach is to prompt the AI to generate a list of ideas based on provided context, allowing users to select their preferred option and then refine and expand.&lt;/p&gt;

&lt;p&gt;An example of a generative AI app is one my team and I created for copywriters to seamlessly integrate storytelling into marketing emails. This user-friendly wizard elevates email creation, combining the power of generative AI with the user's own expertise and creativity for impactful results. Our development journey led to the integration of streaming technology, significantly reducing AI response times.&lt;/p&gt;

&lt;p&gt;In this post, I will cover 2 main tips to enhance your user experience in AI-driven apps through effective use of streaming. Let's dive in!&lt;/p&gt;

&lt;h3&gt;
  
  
  Tech Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Backend: Node.js Typescript with Express&lt;/li&gt;
&lt;li&gt;Frontend: React Typescript&lt;/li&gt;
&lt;li&gt;AI Integration: &lt;a href="https://github.com/openai/openai-node" rel="noopener noreferrer"&gt;OpenAI's Node SDK&lt;/a&gt; and GPT-4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's start with the basics.&lt;/p&gt;

&lt;p&gt;First, send requests to OpenAI leveraging their Node.js SDK. Due to our prompt, the AI response is a list with bullet numbers like “1.”. We use the bullet number format to parse the separate options.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dotenv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPEN_AI_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;chatModel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;createIdeas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="n"&gt;occasion&lt;/span&gt; &lt;span class="p"&gt;}:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;occasion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sb"&gt;`I want to host an event for the      following occasion: ${occasion}. Write me a list of 4 separate ideas for this event`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chatModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;let&lt;/span&gt; &lt;span class="n"&gt;choiceElementElement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;ideas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;parseIdeas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choiceElementElement&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;Use&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;bullet&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="nf"&gt;format &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;ideas&lt;/span&gt; &lt;span class="n"&gt;into&lt;/span&gt; &lt;span class="n"&gt;individual&lt;/span&gt; &lt;span class="n"&gt;elements&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;an&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;
&lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;parseIdeas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;\&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;gm&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;messageSliced&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="n"&gt;messageSliced&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, let’s return the parsed ideas to the frontend. This example method leverages Express.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/generate-ideas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;occasion&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;generatedIdeas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;createIdeas&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="n"&gt;occasion&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;generatedIdeas&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This produces a nicely formatted json response to the frontend that is very easy to pass into the appropriate UI components.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ideas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Example first idea &lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Challenge: Latency
&lt;/h3&gt;

&lt;p&gt;The hiccup? A waiting time of up to 30 seconds before users can view the AI's suggestions. Watching a loading icon spin for half a minute is not a good user experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tip #1: Leverage Streaming
&lt;/h3&gt;

&lt;p&gt;Enter OpenAI's “streaming” feature - a savior for reducing latency. By setting OpenAI's Node SDK input parameter &lt;code&gt;stream&lt;/code&gt; to &lt;strong&gt;&lt;code&gt;true&lt;/code&gt;&lt;/strong&gt;  we display words to the user as they became available. This doesn't expedite the complete generation process, but it cuts down the wait time for the first word. Think of it as the “typewriter” effect seen in ChatGPT.&lt;/p&gt;

&lt;p&gt;To peek under the hood, the streaming feature uses an HTML5 capability called Server Sent Events (SSE). SSE allows servers to push real-time data to web clients over a single HTTP connection. Unlike WebSockets, which is bidirectional, SSE is unidirectional, making it perfect for sending data from the server to the client in scenarios where the client doesn't need to send data back.&lt;/p&gt;

&lt;p&gt;So, we refactor the request to OpenAI to include the input parameter &lt;code&gt;stream&lt;/code&gt; and we return a Stream wrapped in an API Promise to our controller method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;createIdeas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="n"&gt;occasion&lt;/span&gt; &lt;span class="p"&gt;}:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;occasion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chatModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sb"&gt;`I want to host an event for the following occasion: ${occasion}. Write me a list of 4 separate ideas for this event`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After understanding the mechanics, our next step was to synchronize the frontend client that communicates with the backend Express endpoint above with the frontend to support SSE.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;requestIdeas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;occasion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;const&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sb"&gt;`${process.env.REACT_APP_API_BASE_URL}/generate-ideas`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;POST&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="n"&gt;occasion&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt; &lt;span class="err"&gt; &lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipeThrough&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TextDecoderStream&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;getReader&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Impact
&lt;/h3&gt;

&lt;p&gt;The results are night and day. From a staggering 30-second wait, users now see the initial AI-generated content within half a second.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tip #2: UI Components of Streaming
&lt;/h3&gt;

&lt;p&gt;When users are given multiple options to choose from, the UI should split those different options into different components. For example, each option should be a radio button or a different div. But streaming text in real-time throws a wrench in the works. How can we differentiate, parse each AI-generated suggestion as distinct UI components?&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: Add parsing based on a unique character.
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add a unique Bullet Identifier:&lt;/strong&gt; Update your prompt to the AI to use an unusual character as a bullet point that is unlikely to appear in the rest of your text. We used the “¶” symbol and updated our prompt to include the following: &lt;code&gt;Start each bullet point with a new line and the ¶ character. Do not include the normal bullet point character, the '-' character, or list numbers.&lt;/code&gt;`&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splitting the Stream:&lt;/strong&gt; We segmented each byte array from the SSE endpoint into distinct words. This separation was pivotal, given that a single SSE byte array content was unpredictable. Sometimes it included a single word, other times it included full phrases that contained the “¶” character, like &lt;code&gt;subject matter. ¶ Engage&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append each word:&lt;/strong&gt; Once each word is prepared, we append the value to the appropriate UI component. Tracking the “¶” occurrences helps us assign words to the correct component. For instance, a single “¶” means the content belonged to the first-option component. Repeating this process in the loop until the SSE endpoint closed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;export const parseResponseValue = (index: number, value: string, setters: Function[]) =&amp;gt; {&lt;br&gt;
    // Separate into individual words only (no phrases)&lt;br&gt;
    const splitValue = value.split(/(?! {2})/g);&lt;/p&gt;

&lt;p&gt;    for (let word of splitValue) {&lt;br&gt;
        if (word.includes('¶')) {&lt;br&gt;
            index++;&lt;br&gt;
        } else {&lt;br&gt;
            setters&lt;a href="(prev:%20string)%20=&amp;gt;%20{&amp;lt;br&amp;gt;%0A%C2%A0%20%C2%A0%20%C2%A0%20%C2%A0%20%C2%A0%20%C2%A0%20%C2%A0%20%C2%A0%20return%20prev%20+%20word;&amp;lt;br&amp;gt;%0A%C2%A0%20%C2%A0%20%C2%A0%20%C2%A0%20%C2%A0%20%C2%A0%20}"&gt;index&lt;/a&gt;;&lt;br&gt;
        }&lt;br&gt;
    }&lt;br&gt;
    // Return the index to the calling function for use for the next byte array received from the endpoint. &lt;br&gt;
    return index;&lt;br&gt;
};&lt;/p&gt;

&lt;p&gt;export const requestIdeas = async (occasion: string) =&amp;gt; {&lt;br&gt;
    const res = await fetch(&lt;code&gt;${process.env.REACT_APP_API_BASE_URL}/generate-ideas&lt;/code&gt;, {&lt;br&gt;
        method: 'POST',&lt;br&gt;
        headers: {&lt;br&gt;
            'Content-Type': 'application/json'&lt;br&gt;
        },&lt;br&gt;
        body: JSON.stringify({ occasion })&lt;br&gt;
    });&lt;br&gt;
    return res.body?.pipeThrough(new TextDecoderStream()).getReader();&lt;br&gt;
};&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Though string parsing occasionally fails due to edge cases from the AI, it facilitates an overall better user experience by stylizing real-time text streaming. On encountering an AI anomaly, equip users with the ability to retry AI generation. Generally, this fixes any parsing issue encountered the first time.&lt;/p&gt;

&lt;p&gt;Importantly, by avoiding multiple smaller requests, we economized on tokens sent to GPT-4. This not only curtailed costs but also enriched the result quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Harnessing the power of generative AI in applications is undeniably transformative, but it doesn't come without its challenges. As we've explored, latency can be a significant hurdle, potentially hampering user experience. However, with innovative solutions like real-time streaming and strategic UI component parsing, we can overcome these challenges, making our applications not only more responsive but also user-friendly.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>Never Forget To Remember with Husky + Githooks</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:41:54 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/never-forget-to-remember-with-husky-githooks-3nd5</link>
      <guid>https://dev.to/focused_dot_io/never-forget-to-remember-with-husky-githooks-3nd5</guid>
      <description>&lt;p&gt;&lt;em&gt;Don’t forget to run the tests before you merge into main! Remember to run ESLint before pushing! Please include the ticket number in your commit message so  {APP_INTEGRATION}  can do its thing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Agreeing to common conventions and processes frees up bandwidth for engineers to focus on tasks that are interesting, meaningful, and add value for our clients. Onboarding new engineers to a project takes time and energy and can be inconsistent depending on how often documentation is updated. Most importantly: executing manual tasks is a computer’s job!&lt;/p&gt;

&lt;p&gt;With Husky + githooks, developers can automate boring day-to-day housekeeping tasks, easily share them with colleagues and choose whether to make them required or “opt-in” for all contributors. Moreover, we can document and enforce our repository’s standards with code rather than an SOP or checklist that quickly goes out of date. Below is one approach you could adopt on your own projects!&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation and Setup
&lt;/h3&gt;

&lt;p&gt;First we install husky as a dev dependency.&lt;/p&gt;

&lt;p&gt;Now to finish setting up husky run:&lt;/p&gt;

&lt;p&gt;At this point you should see a &lt;code&gt;.husky/&lt;/code&gt; directory in your project. ? ?  One thing I love about Husky’s latest version is that &lt;em&gt;all&lt;/em&gt; of its magic is contained in one, conspicuous, straightforward directory. “ &lt;em&gt;What is this tool event doing?! Oh, it’s all there.”&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Does it work?
&lt;/h3&gt;

&lt;p&gt;Let’s test our install and setup by adding a post-commit hook that logs an important affirmation to our fellow contributors. To do that we can use the &lt;code&gt;husky add&lt;/code&gt; command:&lt;/p&gt;

&lt;p&gt;Now we can take comfort in these wise words after we commit. ?&lt;/p&gt;

&lt;h3&gt;
  
  
  Run Tests
&lt;/h3&gt;

&lt;p&gt;Affirmations are nice and all, but it would be much more practical to do something like run unit tests before each commit so I don’t break the build for my team over lunch. To do that, we’d pull out the &lt;code&gt;husky add&lt;/code&gt; command again:&lt;/p&gt;

&lt;p&gt;If I attempt to commit code that breaks tests, the commit will abort and log failures to the console.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linting
&lt;/h3&gt;

&lt;p&gt;We can also lint and fix before we commit. If I have an existing script that does linting for me, i.e.,:&lt;/p&gt;

&lt;p&gt;then I can run &lt;code&gt;npx husky add .husky/pre-commit "npm run lint-fix"&lt;/code&gt; No more wrangling with built-in IDE/editor settings and extensions!&lt;/p&gt;

&lt;h3&gt;
  
  
  Other applications
&lt;/h3&gt;

&lt;p&gt;Rinse. Wash. Repeat! At this point you’re all set to figure out which hooks and commands make the most sense on your project.&lt;/p&gt;

&lt;h3&gt;
  
  
  FAQ + Gotchas
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why do I get following error?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;.git can't be found&lt;/code&gt; &lt;em&gt;when trying to run&lt;code&gt;husky install&lt;/code&gt;&lt;/em&gt; &lt;/p&gt;

&lt;p&gt;Husky assumes that your &lt;code&gt;package.json&lt;/code&gt; and &lt;code&gt;.git&lt;/code&gt; directory are at the same level. This might not be the case in, say, a mono repo with distinct &lt;code&gt;frontend/&lt;/code&gt; &lt;code&gt;backend/&lt;/code&gt; packages. Husky supports this by means of a &lt;a href="https://typicode.github.io/husky/#/?id=custom-directory" rel="noopener noreferrer"&gt;custom-directory&lt;/a&gt;. If I only really want to run use githooks to manage my frontend code, I can add this npm script to the &lt;code&gt;scripts&lt;/code&gt; object in &lt;code&gt;package.json&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;Now when we run &lt;code&gt;npm run onetime-setup-husky&lt;/code&gt; we will first navigate to the directory that contains &lt;code&gt;.git&lt;/code&gt; and then install the &lt;code&gt;.husky&lt;/code&gt; directory in the &lt;code&gt;frontend/&lt;/code&gt; directory. This only needs to be run once!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;I doubt I’m going to get folks on my team to set all this up and run all these commands in the right order. Is there a way to make this easier?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes! One approach would be to create a script that does, well, &lt;em&gt;everything&lt;/em&gt; :&lt;/p&gt;

&lt;p&gt;Now all you’d have to do is ask your colleague to run &lt;code&gt;npm run onetime-setup-all-the-hooks-and-all-the-things&lt;/code&gt; and they would be in business. I’d advise you rename it to something a little more reasonable, but this gets the point across ?.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I make all of this mandatory for all contributors? How do I make sure everyone else sets it up properly?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To make setup as painless and automatic as possible, ensure your &lt;code&gt;.husky/&lt;/code&gt; directory is tracked in your repository and then rename the &lt;code&gt;onetime-setup*&lt;/code&gt; script to &lt;code&gt;prepare&lt;/code&gt; in your &lt;code&gt;package.json&lt;/code&gt; scripts, i.e.,&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;prepare&lt;/code&gt; script will be executed auto-magically after &lt;code&gt;npm install&lt;/code&gt; executes. Check out &lt;a href="https://docs.npmjs.com/cli/v9/using-npm/scripts#prepare-and-prepublish" rel="noopener noreferrer"&gt;the npm docs&lt;/a&gt; to learn more about it. And now that the &lt;code&gt;.husky&lt;/code&gt; directory is tracked by git, you and your colleagues will be running the exact same scripts across machines so no need to inline them and pollute your &lt;code&gt;package.json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I want to add husky + git hooks for myself, but my team isn’t totally convinced— can we meet in the middle?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Understandable. Nobody likes to be micro-managed and admittedly some tasks are less important than others. One dev might prefer to use their IDE’s built-in prettier/ESLint tooling. Another dev might elect to run unit tests constantly using Their Favorite Tool, in which case a pre-commit hook doing the same would be frustrating and redundant. That’s okay!&lt;/p&gt;

&lt;p&gt;For one, be sure and add &lt;code&gt;.husky/&lt;/code&gt; to your &lt;code&gt;.gitignore&lt;/code&gt; so it remains untracked and out of your colleagues’ hair.&lt;/p&gt;

&lt;p&gt;To “opt-out” of these hooks, do nothing! If you don’t ever run &lt;code&gt;npm run onetime-setup*&lt;/code&gt; script then you are all set. If you do and later regret it: delete the &lt;code&gt;.husky/&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Why didn’t you use husky’s built in super quick auto setup?&lt;/em&gt; There is indeed an &lt;a href="https://typicode.github.io/husky/#/?id=automatic-recommended" rel="noopener noreferrer"&gt;even quicker way&lt;/a&gt; to set up Husky. For the sake of learning, however, I thought it’d be more illustrative to setup husky the slightly longer way. Additionally, the above approach above works in the case of mono-repos where &lt;code&gt;.git&lt;/code&gt; and &lt;code&gt;package.json&lt;/code&gt; are located in different directories.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Why do you need Husky for this if githooks work out of the box?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You do not need Husky to leverage githooks. But I have found Husky quite useful for all of the reasons above: shareability, opt-in, enforcing code standards with code etc.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>A Tight Schedule: Spring, Kubernetes, and Scheduled Jobs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:41:21 +0000</pubDate>
      <link>https://dev.to/focused_dot_io/a-tight-schedule-spring-kubernetes-and-scheduled-jobs-3k30</link>
      <guid>https://dev.to/focused_dot_io/a-tight-schedule-spring-kubernetes-and-scheduled-jobs-3k30</guid>
      <description>&lt;h4&gt;
  
  
  TL;DR
&lt;/h4&gt;

&lt;p&gt;Be wary of threading, locking, and job duration when using Spring Scheduler on an application deployed in &lt;a href="https://dev.to/lab/pivotal-cloud-foundry"&gt;Kubernetes&lt;/a&gt; to prevent duplicate and delayed job runs. If you're not too far down the Spring Scheduler route, consider an alternate solution like &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/job/" rel="noopener noreferrer"&gt;Kubernetes CronJob&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Background
&lt;/h3&gt;

&lt;p&gt;My pair and I were tasked with implementing two daily jobs in our Spring Boot Application to email a set of users. Before jumping into solutions we outlined our problem constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our Spring Boot Application is deployed on Kubernetes&lt;/li&gt;
&lt;li&gt;The scheduled jobs:&lt;/li&gt;
&lt;li&gt;Need to run at a specific time&lt;/li&gt;
&lt;li&gt;Have variable runtimes and usually run on the order of minutes&lt;/li&gt;
&lt;li&gt;Share core application implementation and splitting out dependencies would introduce risk and complexity​&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We narrowed down our solutions to &lt;a href="https://docs.spring.io/spring-framework/docs/3.2.x/spring-framework-reference/html/scheduling.html" rel="noopener noreferrer"&gt;Spring Scheduler&lt;/a&gt; and using a &lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/job/" rel="noopener noreferrer"&gt;Kubernetes CronJob&lt;/a&gt;. We decided to start with Spring Scheduler because it was familiar and after adding a couple annotations we were up and running. The two jobs were scheduled to run at 11:00 and we added some logging so we could keep an eye on them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@Scheduled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 0 11 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;public&lt;/span&gt; &lt;span class="n"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;job1&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running job1.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;sendEmails&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;​&lt;/span&gt;
&lt;span class="nd"&gt;@Scheduled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 0 11 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;public&lt;/span&gt; &lt;span class="n"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;job2&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running job2.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;sendMoreEmails&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;​We pushed the changes and discussed our testing strategy while we waited for our new jobs to run. Our initial test would be to check the logs for entries related to the scheduled jobs. We needed to verify that each job:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ran at the configured time (11 AM)&lt;/li&gt;
&lt;li&gt;Ran exactly once​&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We'll use a table to track our progress:​| Job | Ran at the configured time | Ran exactly once || ---- | -------------------------- | ---------------- || job1 | - | - || job2 | - | - |&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem
&lt;/h3&gt;

&lt;p&gt;When we checked the logs later that day we found that the job had run; however, we saw that there were two log entries for both jobs! We also noticed that the second job (“job2”) had started slightly after 11:00 in both cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.000&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.000&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;07&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;42.915&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;03.792&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our test results:​| Job | Ran at the configured time | Ran exactly once || ---- | -------------------------- | ---------------- || job1 | ✅ | ❌ || job2 | ✅ | ❌ |​We learned two things from the logs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The duplicate entries were from different application instances (based on the pod name in the log context)&lt;/li&gt;
&lt;li&gt;The jobs on each instance were running on the same thread "scheduling-1"​&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Weird. Down the rabbit hole we go!Duplicate Job RunsSince we only had two application instances deployed in our development environment we guessed that it was running the scheduled jobs on each instance. We verified this by increasing and decreasing the number of pods and checking our logs.​A quick search on Stack Overflow told us &lt;a href="https://stackoverflow.com/questions/31288810/spring-scheduled-task-running-in-clustered-environment" rel="noopener noreferrer"&gt;we weren't&lt;/a&gt;&lt;a href="https://stackoverflow.com/questions/49533543/spring-and-scheduled-tasks-on-multiple-instances" rel="noopener noreferrer"&gt;the first ones&lt;/a&gt; to encounter the problem of running a scheduled job in an application deployed in Kubernetes. After looking at some solutions and chatting with colleagues we decided to check out a locking solution called &lt;a href="https://github.com/lukas-krecan/ShedLock" rel="noopener noreferrer"&gt;ShedLock&lt;/a&gt;. Briefly, ShedLock maintains a database table that can be used as a lock amongst application instances. We're going to gloss over ShedLock configuration because many folks have already &lt;a href="https://www.notion.so/f49c78bfef9349629ced849e83f83284" rel="noopener noreferrer"&gt;covered it in depth&lt;/a&gt;. Once configured, the first application instance to run the job will create a lock to prevent any other instance from running the same job. When the job is done or the max lock time is reached, the lock is released.​Here's what our job looks like with the ShedLock annotations ("@ScheduledLock"), minimum lock time ("lockAtLeastFor"), and maximum lock time (“lockAtMostFor”):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@Scheduled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 0 11 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@SchedulerLock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtLeastFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PT5m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtMostFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PT10m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;public&lt;/span&gt; &lt;span class="n"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;job1&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running job1.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;sendEmails&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;​&lt;/span&gt;
&lt;span class="nd"&gt;@Scheduled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 0 11 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@SchedulerLock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtLeastFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PT5m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtMostFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PT10m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;public&lt;/span&gt; &lt;span class="n"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;job2&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running job2.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;sendMoreEmails&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: We'll come back to the "lockAtLeastFor" and "lockAtMostFor" values in a bit…&lt;/em&gt; ​We deployed the changes and checked the logs again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.000&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.000&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;08&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;03.206&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Duplicate logs again! However, this time it was only for one of the jobs - progress!​| Job | Ran at the configured time | Ran exactly once || ---- | -------------------------- | ---------------- || job1 | ✅ | ❌ || job2 | ❌ | ✅ |​&lt;/p&gt;

&lt;h3&gt;
  
  
  The Duration of the Lock
&lt;/h3&gt;

&lt;p&gt;My pair and I were stuck so we drew a diagram to plot the series of events:​&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kcq7nqa4c2c561x31ft.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kcq7nqa4c2c561x31ft.jpeg" alt="Timeline diagram showing two Kubernetes pods with job scheduling, locking, and unlocking between 11:00 and 11:10" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;​Both jobs started at the same time and both locks were enabled. When "job1" finished, the lock was released. Meanwhile, "job2" was still running on the second instance with "job1" queued. Once "job2" finished, "job1" was dequeued and it was able to start because the lock was already released.​The solution for the lock being too short is to increase it to be longer than the other job. So the minimum lock time ("lockAtLeastFor") for "job1" should be longer than "job2" would take and vice versa. Since our jobs were only being run once a day we were able to be liberal with the lock times - we locked it for a minimum of 15 hours and a maximum of 20 hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@Scheduled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 0 11 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@SchedulerLock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtLeastFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PT15h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtMostFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PT20h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;public&lt;/span&gt; &lt;span class="n"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;job1&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running job1.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;sendEmails&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;​&lt;/span&gt;
&lt;span class="nd"&gt;@Scheduled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cron&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 0 11 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@SchedulerLock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtLeastFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PT15h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lockAtMostFor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PT20h&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;public&lt;/span&gt; &lt;span class="n"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;job2&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;println&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running job2.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;sendMoreEmails&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now our timeline looks like this:​&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr73d40e1lkk8jxm06sig.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr73d40e1lkk8jxm06sig.jpeg" alt="Timeline diagram showing two pods with concurrent job locking where both pods remain locked through max lock time" width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;​Locking for 15 hours is overkill for our eight minute job but my pair and I agreed that in general the lock times should be maximized to the frequency to reduce the chance of them running multiple times. Our take is that if your job runs every minute then it should be locked for 59 seconds, if it runs every hour then lock it for 59 minutes, and so on.​We deployed and checked the logs again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.000&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.000&lt;/span&gt;  &lt;span class="p"&gt;[&lt;/span&gt;   &lt;span class="n"&gt;scheduling&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pod&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Running&lt;/span&gt; &lt;span class="n"&gt;job2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;| Job | Ran at the configured time | Ran exactly once || ---- | -------------------------- | ---------------- || job1 | ✅ | ✅ || job2 | ✅ | ✅ |Success!Scheduled Jobs on Same ThreadWe had solved the duplicate runs with locking but my pair and I were still hung up on the jobs running on the same thread. Our solution wasn't foolproof because adding another job would still cause a job to be queued behind another. This means that a third job would run after another rather than at the scheduled time:​&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9nepfn297ulhsipsiq1.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9nepfn297ulhsipsiq1.jpeg" alt="Timeline showing three queued jobs across two pods with job3 delayed because it is queued behind job1" width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;​More generally, the queueing issue occurs whenever there are more scheduled jobs than application instances. After reading some thrilling Spring documentation we learned that &lt;a href="https://docs.spring.io/spring-framework/docs/3.2.x/spring-framework-reference/html/scheduling.html" rel="noopener noreferrer"&gt;Spring Scheduler is configured to use a single thread by default&lt;/a&gt;. The fix was to increase the threads allocated to the scheduler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@Configuration&lt;/span&gt;
&lt;span class="n"&gt;public&lt;/span&gt; &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SchedulingConfigurerConfiguration&lt;/span&gt; &lt;span class="n"&gt;implements&lt;/span&gt; &lt;span class="n"&gt;SchedulingConfigurer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="err"&gt;​&lt;/span&gt;
        &lt;span class="nd"&gt;@Override&lt;/span&gt;
        &lt;span class="n"&gt;public&lt;/span&gt; &lt;span class="n"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;configureTasks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ScheduledTaskRegistrar&lt;/span&gt; &lt;span class="n"&gt;taskRegistrar&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;ThreadPoolTaskScheduler&lt;/span&gt; &lt;span class="n"&gt;taskScheduler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolTaskScheduler&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
                &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt; &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
                &lt;span class="n"&gt;taskScheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setPoolSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="n"&gt;taskScheduler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
                &lt;span class="n"&gt;taskRegistrar&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setTaskScheduler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskScheduler&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now that the scheduler has more threads we can avoid queuing jobs and each will start at the configured time. Once a job has started, the lock is put in place preventing any other application instance from starting the same job:​&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ur1udro7pkovkfceq7j.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ur1udro7pkovkfceq7j.jpeg" alt="Diagram of two pods with separate scheduling threads using locks to prevent duplicate job execution" width="800" height="292"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;​My pair and I agreed that the safest route was to configure locking &lt;em&gt;and&lt;/em&gt; increase the number of threads. By using both solutions, we minimize the chance of any job being run more than once or not being run at the correct time.​ConclusionMy pair and I learned a lot about the pitfalls of Spring Scheduler and how running jobs in Kubernetes introduces new complexity. Between threading, locking, job duration, and application instances there are a lot of potential snags that you may hit along the way. Hopefully our discoveries help at least a couple folks navigate implementation.​&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://stackoverflow.com/a/49533618/20514850" rel="noopener noreferrer"&gt;https://stackoverflow.com/a/49533618/20514850&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://levelup.gitconnected.com/solving-multiple-executions-of-scheduled-tasks-across-multiple-nodes-even-with-shedlock-spring-2b1d26db9356" rel="noopener noreferrer"&gt;https://levelup.gitconnected.com/solving-multiple-executions-of-scheduled-tasks-across-multiple-nodes-even-with-shedlock-spring-2b1d26db9356&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.baeldung.com/shedlock-spring" rel="noopener noreferrer"&gt;https://www.baeldung.com/shedlock-spring&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dhananjay4058.medium.com/lock-scheduled-tasks-with-shedlock-and-spring-boot-f67200dad675" rel="noopener noreferrer"&gt;https://dhananjay4058.medium.com/lock-scheduled-tasks-with-shedlock-and-spring-boot-f67200dad675&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>java</category>
      <category>programming</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
