Harish Kotra (he/him)

Posted on May 6

Building AI Behavior Lab: A Developer-First Debugger for Memory, Context, and Tooling

#langchain #ai #programming #dailybuild2026

TL;DR

Most LLM demos look great until one of three things changes in production:

memory state,
context quality,
tool orchestration.

AI Behavior Lab makes those hidden layers visible by running one prompt through multiple capability configurations and exposing exactly what changed.

The Problem We Wanted to Solve

Teams often ask:

Why did this work yesterday but not today?
Why is my assistant “smart” in one flow and generic in another?
Why did adding tools increase latency but not quality?

The root issue is usually invisible execution state. So this app turns hidden runtime inputs into first-class UI artifacts.

Product Model

Each run is a controlled experiment over the same user prompt with different runtime capabilities.

Base: no memory, no context, no tools
Memory ON: adds conversation history
Context ON: adds retriever results
Harness ON: enables live tool interaction

For every run, the app returns:

output text,
latency,
token usage,
estimated cost,
memory/context/tool details,
final composed prompt,
timeline events.

Architecture

Stack and Why

Next.js App Router: unified full-stack DX and simple API routes
Zustand: lightweight local state control for prompt/playground UI
LangChain: memory + retrieval + tool abstractions
Flue SDK: agent-oriented runtime structure in .flue/agents
Recharts: quick observability panels

Core Runtime Design

runBehaviorScenario() is the execution nucleus. It does four things:

Load memory when enabled
Retrieve context when enabled
Run tool path when enabled
Compose final prompt and invoke selected provider model

Compare mode fan-out

const runs = await Promise.all(
  scenarios.map((s) =>
    runBehaviorScenario({ ...payload, ...s.flags }, s.key)
  )
);

This gives deterministic same-input multi-path comparisons in one UI step.

Memory Engineering

We store BufferMemory per sessionId:

const memory = getSessionMemory(sessionId);
const loaded = await memory.loadMemoryVariables({ input });
await memory.saveContext({ input }, { output });

Why this matters: follow-up prompts like “make it vegetarian” now become measurable behavior changes instead of intuition.

Context Engineering

Context is retriever-backed rather than hardcoded string concatenation:

const store = await MemoryVectorStore.fromDocuments(docs, new FakeEmbeddings());
const retriever = store.asRetriever(2);
const results = await retriever.invoke(input);

This keeps context insertion logic aligned with production retrieval patterns.

Harness Engineering (Tools)

Harness mode binds Tavily as a callable tool and supports model-directed tool calls:

const tavily = new TavilySearchResults({ maxResults: 3, apiKey: process.env.TAVILY_API_KEY });
const withTools = model.bindTools([tavily]);
const first = await withTools.invoke(messages);

When a tool call occurs, we execute it, feed results back as ToolMessage, and request final response.

Observability by Default

Each run ships with diagnostics used directly in UI:

timeline[] stage events (memory, context, tool, llm)
prompt/completion token breakdown
estimated cost by provider rates

This enables post-run reasoning like:

“Harness added 700ms but improved specificity.”
“Context increased token load with minimal output delta.”

Health Checks Before Execution

/api/health validates config readiness for providers and Tavily so users don’t debug phantom behavior caused by missing keys.

UI Strategy

The UI is intentionally diagnostic:

side-by-side cards for parallel comparison,
a dedicated “what changed” tab,
a literal final prompt tab,
telemetry panel for latency/cost/timeline,
single-run playground for controlled toggles.

Lessons Learned

“Prompt quality” is often runtime quality.
Memory/context/tools should be observable artifacts, not hidden abstractions.
Side-by-side comparison beats prose explanation for developer learning.

Where to Take It Next

Batch regression suite over prompt sets
Prompt snapshots and semantic diff
Golden outputs and drift alarms
Multi-provider benchmark lanes over repeated runs
Exportable run reports for team reviews

AI Behavior Lab is less chatbot and more instrumentation surface. The point is not just generating text, it is making behavior debuggable.

Github repo and more: https://www.dailybuild.xyz/project/124-ai-behavior-lab

DEV Community