DEV Community

Cover image for Building AI Behavior Lab: A Developer-First Debugger for Memory, Context, and Tooling
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Building AI Behavior Lab: A Developer-First Debugger for Memory, Context, and Tooling

TL;DR

Most LLM demos look great until one of three things changes in production:

  • memory state,
  • context quality,
  • tool orchestration.

AI Behavior Lab makes those hidden layers visible by running one prompt through multiple capability configurations and exposing exactly what changed.

The Problem We Wanted to Solve

Teams often ask:

  • Why did this work yesterday but not today?
  • Why is my assistant “smart” in one flow and generic in another?
  • Why did adding tools increase latency but not quality?

The root issue is usually invisible execution state. So this app turns hidden runtime inputs into first-class UI artifacts.

Product Model

Each run is a controlled experiment over the same user prompt with different runtime capabilities.

  • Base: no memory, no context, no tools
  • Memory ON: adds conversation history
  • Context ON: adds retriever results
  • Harness ON: enables live tool interaction

For every run, the app returns:

  • output text,
  • latency,
  • token usage,
  • estimated cost,
  • memory/context/tool details,
  • final composed prompt,
  • timeline events.

Architecture

Architecture

Stack and Why

  • Next.js App Router: unified full-stack DX and simple API routes
  • Zustand: lightweight local state control for prompt/playground UI
  • LangChain: memory + retrieval + tool abstractions
  • Flue SDK: agent-oriented runtime structure in .flue/agents
  • Recharts: quick observability panels

Core Runtime Design

runBehaviorScenario() is the execution nucleus. It does four things:

  1. Load memory when enabled
  2. Retrieve context when enabled
  3. Run tool path when enabled
  4. Compose final prompt and invoke selected provider model

Compare mode fan-out

const runs = await Promise.all(
  scenarios.map((s) =>
    runBehaviorScenario({ ...payload, ...s.flags }, s.key)
  )
);
Enter fullscreen mode Exit fullscreen mode

This gives deterministic same-input multi-path comparisons in one UI step.

Memory Engineering

We store BufferMemory per sessionId:

const memory = getSessionMemory(sessionId);
const loaded = await memory.loadMemoryVariables({ input });
await memory.saveContext({ input }, { output });
Enter fullscreen mode Exit fullscreen mode

Why this matters: follow-up prompts like “make it vegetarian” now become measurable behavior changes instead of intuition.

Context Engineering

Context is retriever-backed rather than hardcoded string concatenation:

const store = await MemoryVectorStore.fromDocuments(docs, new FakeEmbeddings());
const retriever = store.asRetriever(2);
const results = await retriever.invoke(input);
Enter fullscreen mode Exit fullscreen mode

This keeps context insertion logic aligned with production retrieval patterns.

Harness Engineering (Tools)

Harness mode binds Tavily as a callable tool and supports model-directed tool calls:

const tavily = new TavilySearchResults({ maxResults: 3, apiKey: process.env.TAVILY_API_KEY });
const withTools = model.bindTools([tavily]);
const first = await withTools.invoke(messages);
Enter fullscreen mode Exit fullscreen mode

When a tool call occurs, we execute it, feed results back as ToolMessage, and request final response.

Observability by Default

Each run ships with diagnostics used directly in UI:

  • timeline[] stage events (memory, context, tool, llm)
  • prompt/completion token breakdown
  • estimated cost by provider rates

This enables post-run reasoning like:

  • “Harness added 700ms but improved specificity.”
  • “Context increased token load with minimal output delta.”

Health Checks Before Execution

/api/health validates config readiness for providers and Tavily so users don’t debug phantom behavior caused by missing keys.

UI Strategy

The UI is intentionally diagnostic:

  • side-by-side cards for parallel comparison,
  • a dedicated “what changed” tab,
  • a literal final prompt tab,
  • telemetry panel for latency/cost/timeline,
  • single-run playground for controlled toggles.

Lessons Learned

  1. “Prompt quality” is often runtime quality.
  2. Memory/context/tools should be observable artifacts, not hidden abstractions.
  3. Side-by-side comparison beats prose explanation for developer learning.

Where to Take It Next

  • Batch regression suite over prompt sets
  • Prompt snapshots and semantic diff
  • Golden outputs and drift alarms
  • Multi-provider benchmark lanes over repeated runs
  • Exportable run reports for team reviews

AI Behavior Lab is less chatbot and more instrumentation surface. The point is not just generating text, it is making behavior debuggable.

How it works

Github repo and more: https://www.dailybuild.xyz/project/124-ai-behavior-lab

Top comments (0)