DEV Community: Raju Dandigam

How to Test AI Agents Without Calling More LLMs

Raju Dandigam — Thu, 30 Jul 2026 16:35:04 +0000

AI-agent testing often starts with an expensive loop: call the agent, send its answer to another model, ask for a quality score, and hope the score is stable enough for CI.

LLM judges can be useful for genuinely semantic questions. They are a poor default for verifying tool order, retry limits, validation, budgets, state transitions, or error handling. Those behaviors have explicit contracts and can usually be tested with ordinary deterministic techniques.

The most effective agent test suite separates orchestration correctness from language quality. Test the first on every commit with fakes, traces, and rules. Evaluate the second with a smaller, calibrated model-and-human workflow.

What Can Be Deterministic?

An agent may produce variable language while still following a predictable execution contract. Useful deterministic checks include:

Input validation runs before any model or tool call.
Only authorized tools are available for the current state.
Retrieval completes before generation uses its results.
Tool arguments match a schema.
Retries stop after the configured limit.
A timeout or cancellation ends the run.
A fallback is used only for approved error categories.
Token and tool-call budgets are enforced.
Every started span is completed exactly once.
Sensitive payloads are absent from the trace.

None of these questions requires a second model. Most do not require a first model either.

Use a Testing Pyramid for Agents

Layer	Model access	Purpose
Pure unit tests	None	Routing, validation, budgeting, state reducers
Orchestration tests	Scripted fake	Tool order, retries, fallbacks, termination
Tool contract tests	None or sandbox	Schemas, timeouts, error mapping
Trace replay tests	None	Regression checks against recorded execution metadata
Model integration tests	Real model	Provider compatibility and a small set of end-to-end paths
Semantic evaluations	Real model and/or human	Helpfulness, groundedness, tone, nuanced correctness

The lower layers should contain most of the suite. They are fast, reproducible, and actionable. The upper layers are valuable, but they should be deliberately smaller.

Put Models and Tools Behind Interfaces

Dependency injection makes the orchestration testable without network calls.

type ModelReply =
  | { type: 'tool_call'; tool: string; args: unknown }
  | { type: 'final'; text: string; usage: { input: number; output: number } };

interface ModelClient {
  generate(input: {
    messages: Array<{ role: string; content: string }>;
    tools: string[];
  }): Promise<ModelReply>;
}

interface ToolClient {
  execute(name: string, args: unknown): Promise<unknown>;
}

interface Clock {
  now(): number;
}

Production code receives real implementations. Tests provide scripted implementations with known behavior.

Script Model Behavior Instead of Generating It

A scripted model returns a predefined sequence and fails if the agent makes an unexpected extra call.

class ScriptedModel implements ModelClient {
  private index = 0;

  constructor(private readonly replies: ModelReply[]) {}

  async generate(): Promise<ModelReply> {
    const reply = this.replies[this.index];
    if (!reply) {
      throw new Error(`Unexpected model call at index ${this.index}`);
    }

    this.index += 1;
    return structuredClone(reply);
  }

  assertConsumed(): void {
    if (this.index !== this.replies.length) {
      throw new Error(
        `Expected ${this.replies.length} model calls, received ${this.index}`,
      );
    }
  }
}

This fake does not imitate model intelligence. It controls the branch that the orchestration must handle. One script can request a tool and then return a final answer; another can repeatedly request an invalid tool call to verify termination.

Record a Test-Friendly Trace

The trace should expose behavior without coupling tests to prose output.

type TraceStep = {
  sequence: number;
  name: string;
  kind: 'run' | 'model' | 'tool' | 'policy' | 'fallback';
  status: 'ok' | 'error' | 'blocked';
  parentId: string | null;
  attempt?: number;
  metadata?: Record<string, string | number | boolean | null>;
};

type AgentTrace = {
  status: 'ok' | 'error' | 'blocked';
  steps: TraceStep[];
};

Use a monotonic sequence assigned by the recorder for causal assertions. Do not infer retry behavior from consecutive names: parallel work can interleave events, and unrelated steps can separate attempts.

Keep the trace schema stable and metadata-first. Tests should assert execution contracts, not depend on raw prompts or complete tool results.

Build Reusable Trace Assertions

Small domain-specific helpers make failures easier to understand than generic array comparisons.

function stepIndex(trace: AgentTrace, name: string): number {
  return trace.steps.findIndex((step) => step.name === name);
}

function expectStepBefore(
  trace: AgentTrace,
  first: string,
  second: string,
): void {
  const firstIndex = stepIndex(trace, first);
  const secondIndex = stepIndex(trace, second);

  if (firstIndex < 0) throw new Error(`Missing trace step: ${first}`);
  if (secondIndex < 0) throw new Error(`Missing trace step: ${second}`);
  if (firstIndex >= secondIndex) {
    throw new Error(`Expected ${first} before ${second}`);
  }
}

function countSteps(trace: AgentTrace, name: string): number {
  return trace.steps.filter((step) => step.name === name).length;
}

When operations run in parallel, assert parentage and required dependencies rather than total ordering. Two sibling tools may complete in either order and both be correct.

Test Retrieval Before Generation

test('retrieves policy before generating an answer', async () => {
  const model = new ScriptedModel([
    {
      type: 'final',
      text: 'Use the documented return process.',
      usage: { input: 420, output: 18 },
    },
  ]);

  const { trace } = await runSupportAgent(
    { model, tools: fakeTools, clock: fakeClock },
    'How do returns work?',
  );

  expectStepBefore(trace, 'retrieve_policy', 'generate_answer');
  expect(countSteps(trace, 'retrieve_policy')).toBe(1);
  model.assertConsumed();
});

The test says exactly what failed: a missing step, an invalid dependency order, or an unexpected model call. No quality score is needed.

Test Retry Limits by Attempt

test('stops a tool after two failed attempts', async () => {
  const tools = new FakeTools({
    lookup_invoice: [
      new TimeoutError(),
      new TimeoutError(),
      { invoiceId: 'should-never-be-returned' },
    ],
  });

  const { trace } = await runInvoiceAgent(
    { model: scriptedToolCaller, tools, clock: fakeClock },
    'Find the latest invoice',
  );

  const attempts = trace.steps.filter(
    (step) => step.name === 'lookup_invoice',
  );

  expect(attempts.map((step) => step.attempt)).toEqual([1, 2]);
  expect(tools.calls('lookup_invoice')).toHaveLength(2);
  expect(trace.status).toBe('error');
});

This catches an unbounded retry loop without waiting for a real service or model.

Test Guardrails as Control Flow

A blocked request should produce no downstream model or tool activity.

test('blocks unauthorized export before external calls', async () => {
  const { trace } = await runExportAgent(
    { model: modelThatMustNotRun, tools: toolsThatMustNotRun, clock: fakeClock },
    { action: 'export_all_accounts', authorized: false },
  );

  expect(trace.status).toBe('blocked');
  expect(stepIndex(trace, 'authorize_export')).toBeGreaterThanOrEqual(0);
  expect(trace.steps.some((step) => step.kind === 'model')).toBe(false);
  expect(trace.steps.some((step) => step.kind === 'tool')).toBe(false);
});

The test uses structured input instead of asking a model to recognize a particular malicious phrase. A separate security suite can test prompt-injection resistance with representative text fixtures.

Make Time Deterministic

Wall-clock assertions are often flaky under CI load. Inject a fake clock or scheduler and advance it deliberately.

class FakeClock implements Clock {
  private value = 0;

  now(): number {
    return this.value;
  }

  advance(ms: number): void {
    this.value += ms;
  }
}

test('uses fallback after the configured timeout', async () => {
  const clock = new FakeClock();
  const tools = new FakeTools({ primary_search: [new TimeoutError()] });

  const { trace } = await runSearchAgent(
    { model: scriptedToolCaller, tools, clock },
    'pricing',
  );

  expect(stepIndex(trace, 'fallback_search')).toBeGreaterThanOrEqual(0);
});

Keep a small number of real timing tests for integration boundaries. Do not make every unit test depend on an overloaded runner finishing within an arbitrary number of milliseconds.

Test Budgets at the Enforcement Point

Model fakes can return explicit usage values. That lets you verify budget logic without paying for tokens.

test('stops before exceeding the session token budget', async () => {
  const model = new ScriptedModel([
    { type: 'final', text: 'first', usage: { input: 700, output: 200 } },
    { type: 'final', text: 'second', usage: { input: 700, output: 200 } },
  ]);

  const { trace } = await runBudgetedAgent(
    { model, tools: fakeTools, clock: fakeClock, maxTokens: 1_200 },
    'continue',
  );

  expect(trace.steps.some((step) => step.name === 'token_budget_block')).toBe(true);
  expect(countSteps(trace, 'model_call')).toBe(1);
});

This proves the enforcement behavior. A separate provider integration test can verify that production usage fields are mapped correctly into the internal token model.

Replay Traces for Regression Tests

Recorded metadata traces can drive tests without replaying sensitive content or calling a model. Store a compact fixture describing model decisions, tool outcomes, usage, and expected invariants.

Avoid treating the entire trace as a snapshot that must match byte for byte. IDs, timestamps, and harmless implementation details make snapshots brittle. Normalize the trace and assert durable properties:

Required and forbidden steps
Parent-child relationships
Attempt limits
Terminal status
Budget totals
Policy outcomes

Update a fixture only when the behavioral contract intentionally changes.

Test Failure Paths, Not Only Happy Paths

Agents fail at boundaries. Include fixtures for:

Invalid model tool arguments
Tool timeouts and rate limits
Empty retrieval results
Cancellation during streaming
Partial tool success
Unauthorized tool selection
Context-budget exhaustion
Trace-sink failure
Fallback failure

Verify that the original error category is preserved, cleanup runs, and no further model or tool work occurs after a terminal state.

Where Real Models Still Belong

Deterministic tests cannot prove that an answer is clear, grounded, polite, or semantically correct. Real-model evaluation remains useful for those questions.

Use a curated dataset and an explicit rubric. Calibrate judge scores against human labels, track disagreement, and review threshold failures. Because model behavior can change, compare distributions and trends rather than treating one score from one run as an unquestionable fact.

A practical schedule is:

Trigger	Recommended checks
Local save	Unit and scripted orchestration tests
Every commit	Trace invariants, policy, retries, budgets, tool contracts
Pull request	Deterministic suite plus a small real-model smoke set
Nightly or weekly	Larger semantic evaluation and regression trends
Release candidate	Human review of high-risk or changed workflows

Teams with stricter cost or reliability requirements can move real-model tests out of pull requests entirely. The important point is to make that trade-off explicit.

Final Thought

Agent output is probabilistic, but agent architecture does not need to be opaque. Validation, tool access, retries, fallbacks, budgets, state transitions, and trace structure can all have deterministic contracts.

Put those contracts behind injectable interfaces, drive them with scripted dependencies, and assert behavior through a stable metadata trace. Save real models and LLM judges for the semantic questions that ordinary code cannot answer.

The next article will turn these ideas into fast CI quality gates with reusable trace rules, baseline comparison, and actionable failure reports.

Building TypeScript-Native Observability: Async Context and Execution Flow

Raju Dandigam — Tue, 28 Jul 2026 20:55:08 +0000

A useful agent trace is not a list of timestamps. It is a causal tree.

When a TypeScript agent retrieves documents in parallel, calls a model, retries a tool, and falls back to cached data, each operation needs a trace ID, its own span ID, and the correct parent span. Without those relationships, completion order is easily mistaken for execution structure.

This article builds a small Node.js tracer to demonstrate the core mechanics: immutable async context, parent-child spans, reliable finalization, and a pluggable sink. It is intentionally smaller than a production observability library, but the design avoids several common mistakes found in minimal examples.

Completion Order Is Not Causality

Imagine three tools running in parallel:

80 ms   search_tickets completes
100 ms  load_account completes
120 ms  search_docs completes

Those timestamps describe completion order. The execution tree describes why the operations existed:

research_agent
└─ parallel_retrieval
   ├─ search_docs
   ├─ search_tickets
   └─ load_account

Both views are useful, but only the tree preserves the relationship between the agent decision and its child tools.

The normal JavaScript call stack cannot serve as that tree. Async work may resume later, execute concurrently, or outlive the function that scheduled it. Tracing therefore needs an explicit logical context.

The Context We Need

Each asynchronous branch needs two values:

type TraceContext = {
  traceId: string;
  parentSpanId: string | null;
};

When a new span starts, it reads the current context, records parentSpanId, creates its own spanId, and runs child work inside a new context whose parent is that span.

In Node.js, AsyncLocalStorage provides the propagation primitive. It carries a value through normal asynchronous resources without adding trace parameters to every application function.

Do not mutate one shared context object. Parallel siblings would race to replace the current span. Create a new context value for every nested span instead.

Define a Small Event Model

Use separate start and end events so a consumer can identify spans that never completed.

type SpanKind = 'run' | 'model' | 'tool' | 'retrieval' | 'decision';
type SpanStatus = 'ok' | 'error';

type SpanStarted = {
  version: 1;
  event: 'span_started';
  traceId: string;
  spanId: string;
  parentSpanId: string | null;
  name: string;
  kind: SpanKind;
  startedAt: string;
};

type SpanEnded = {
  version: 1;
  event: 'span_ended';
  traceId: string;
  spanId: string;
  endedAt: string;
  durationMs: number;
  status: SpanStatus;
  errorCategory?:
    | 'timeout'
    | 'validation'
    | 'authorization'
    | 'dependency'
    | 'unknown';
  metadata?: Record<string, string | number | boolean | null>;
};

type SpanEvent = SpanStarted | SpanEnded;

The schema stores a controlled error category rather than an exception message or stack. Metadata should also be operation-specific in a production design; the flat record keeps this example readable.

Keep Persistence Behind a Sink

The tracer should not own a global event array. Long-running processes would retain every trace in memory, and tests could accidentally read events from unrelated runs.

Use a sink contract instead:

export interface TraceSink {
  enqueue(event: SpanEvent): boolean;
  flush(options?: { timeoutMs?: number }): Promise<void>;
}

enqueue() is deliberately synchronous and non-throwing. It places the event in a bounded buffer and returns false when the event cannot be accepted. The sink handles batching and persistence outside the application’s critical path.

A production sink should expose accepted, dropped, retried, and failed event counts. “Tracing must not crash the request” should not become “tracing may fail silently.”

Build the Tracer

The tracer below scopes context with AsyncLocalStorage, emits one end event from a finally block, and keeps sink failure separate from application failure.

import { AsyncLocalStorage } from 'node:async_hooks';
import { randomUUID } from 'node:crypto';

type ActiveContext = {
  traceId: string;
  parentSpanId: string | null;
};

type SpanOptions = {
  metadata?: () => SpanEnded['metadata'];
};

function errorCategory(
  error: unknown,
): SpanEnded['errorCategory'] {
  if (!(error instanceof Error)) return 'unknown';
  if (error.name === 'AbortError') return 'timeout';
  if (error.name === 'ValidationError') return 'validation';
  if (error.name === 'AuthorizationError') return 'authorization';
  return 'dependency';
}

export class Tracer {
  private readonly context = new AsyncLocalStorage<ActiveContext>();

  constructor(
    private readonly sink: TraceSink,
    private readonly onDroppedEvent: (event: SpanEvent) => void = () => {},
  ) {}

  private emit(event: SpanEvent): void {
    try {
      const accepted = this.sink.enqueue(event);
      if (!accepted) this.onDroppedEvent(event);
    } catch {
      this.onDroppedEvent(event);
    }
  }

  async run<T>(name: string, work: () => Promise<T>): Promise<T> {
    const traceId = randomUUID();

    return this.context.run(
      { traceId, parentSpanId: null },
      () => this.span(name, 'run', work),
    );
  }

  async span<T>(
    name: string,
    kind: SpanKind,
    work: () => Promise<T>,
    options: SpanOptions = {},
  ): Promise<T> {
    const parent = this.context.getStore();
    if (!parent) throw new Error('span() must run inside run()');

    const spanId = randomUUID();
    const startedAt = Date.now();

    this.emit({
      version: 1,
      event: 'span_started',
      traceId: parent.traceId,
      spanId,
      parentSpanId: parent.parentSpanId,
      name,
      kind,
      startedAt: new Date(startedAt).toISOString(),
    });

    let status: SpanStatus = 'ok';
    let failureCategory: SpanEnded['errorCategory'];

    try {
      return await this.context.run(
        { traceId: parent.traceId, parentSpanId: spanId },
        work,
      );
    } catch (error) {
      status = 'error';
      failureCategory = errorCategory(error);
      throw error;
    } finally {
      this.emit({
        version: 1,
        event: 'span_ended',
        traceId: parent.traceId,
        spanId,
        endedAt: new Date().toISOString(),
        durationMs: Date.now() - startedAt,
        status,
        errorCategory: failureCategory,
        metadata: options.metadata?.(),
      });
    }
  }
}

The application exception is rethrown unchanged. A sink problem never enters the try block that classifies application work, so an observability failure cannot turn a successful model call into a failed model span.

The metadata callback runs when the span ends, which is useful when token usage or result counts are not known at start time. Keep the callback deterministic and payload-free.

Trace Parallel Work

All three operations below inherit the parallel_retrieval span as their parent even though they complete at different times.

await tracer.run('research_agent', async () => {
  return tracer.span('parallel_retrieval', 'decision', async () => {
    const [documents, tickets, account] = await Promise.all([
      tracer.span('search_docs', 'retrieval', () => searchDocs()),
      tracer.span('search_tickets', 'tool', () => searchTickets()),
      tracer.span('load_account', 'tool', () => loadAccount()),
    ]);

    return { documents, tickets, account };
  });
});

Each call to span() creates a new immutable context for its own promise chain. Sibling branches never mutate a shared current-span value.

Make Retries Visible

A retry should be a child span, not an overwritten attempt count. That preserves the duration and error category of every attempt.

await tracer.span('load_pricing', 'tool', async () => {
  try {
    return await tracer.span('attempt_1', 'tool', () => {
      return callPricingApi({ timeoutMs: 1_000 });
    });
  } catch {
    try {
      return await tracer.span('attempt_2', 'tool', () => {
        return callPricingApi({ timeoutMs: 3_000 });
      });
    } catch {
      return tracer.span('fallback_to_cache', 'tool', loadCachedPricing);
    }
  }
});

load_pricing
├─ attempt_1          error: timeout
├─ attempt_2          error: dependency
└─ fallback_to_cache  ok

In real application code, catch only the errors that should trigger a retry or fallback. Authentication, validation, and cancellation errors usually need different handling.

Preserve Function Types

A wrapper should retain argument and result types:

type AsyncFn<Args extends unknown[], Result> = (
  ...args: Args
) => Promise<Result>;

function traced<Args extends unknown[], Result>(
  tracer: Tracer,
  name: string,
  kind: SpanKind,
  fn: AsyncFn<Args, Result>,
): AsyncFn<Args, Result> {
  return (...args: Args) => {
    return tracer.span(name, kind, () => fn(...args));
  };
}

const tracedSearch = traced(
  tracer,
  'search_database',
  'tool',
  async (query: string): Promise<string[]> => searchDatabase(query),
);

const results = await tracedSearch('pricing'); // string[]

Methods that depend on this, overloaded functions, and streams need specialized wrappers. Keep those adapters explicit instead of erasing their signatures with any.

Handle Detached and Streaming Work Explicitly

Async context answers “which trace does this work belong to?” It does not decide how long a trace should remain open.

Detached work such as an unawaited background task may continue after the root span ends. Streaming work may continue after a route handler returns a response. Both need an explicit lifecycle policy:

Await work that is part of the request’s success criteria.
Start a new linked trace for a durable background job.
End a streaming span on completion, error, or cancellation.
Flush at controlled lifecycle boundaries, not after every event.
Do not assume serverless runtimes will keep executing after the response is sent.

Context propagation and lifecycle management are related, but they are not the same problem.

Control Overhead and Backpressure

Trace meaningful boundaries rather than every helper function. Agent runs, retrieval, model calls, tools, policy decisions, retries, and fallbacks usually provide enough structure.

Use a bounded queue. When the sink is slower than the event producer, choose and document a policy: drop newest, drop oldest, apply limited backpressure, or disable tracing for the run. Never allow an unbounded buffer to consume the process.

Sampling should normally happen at the trace level. Independently sampling child spans creates broken trees. Keep complete traces for selected runs, and always consider retaining errors or high-latency outliers through a documented policy.

Test the Execution Model

A tracer is correct only if it survives concurrency and failure tests:

Run two traces concurrently and assert that their span IDs never mix.
Start three Promise.all() siblings and assert that they share the expected parent.
Throw from application work and verify the original error reaches the caller.
Force the sink to reject or throw and verify the application result is unchanged.
Cancel a stream and verify the span ends once with a cancellation status.
Flush the sink and assert that every started span has exactly one end event.
Run tests in parallel workers and verify isolation.

Also test the emitted data policy. Execution trees are valuable even when prompts, outputs, tool arguments, and retrieved content are absent.

Relationship to OpenTelemetry

The concepts map naturally to distributed tracing: a run is a trace, an operation is a span, metadata becomes attributes, and context propagation preserves parentage. A production implementation can translate this event model into OpenTelemetry or another backend.

Building the small version first is still useful. It makes the invariants visible: immutable context, one parent per span, exactly one finalization, bounded persistence, and application behavior independent of sink health.

Final Thought

The difficult part of TypeScript agent tracing is not generating IDs. It is preserving causal structure while asynchronous work branches, completes out of order, retries, streams, and occasionally outlives its original request.

AsyncLocalStorage provides a strong Node.js foundation, but reliable observability also needs lifecycle rules, sink isolation, backpressure, and concurrency tests. With those pieces in place, scattered events become an execution tree that developers can trust.

Why TypeScript AI Developers Need Native Tracing Tools

Raju Dandigam — Fri, 24 Jul 2026 16:16:02 +0000

TypeScript support is easy to claim. Publish an npm package, add a few type declarations, and a tracing product can put “JavaScript and TypeScript” on its integration list.

Native support is a higher bar.

AI applications in the TypeScript ecosystem run inside concurrent Node.js servers, serverless functions, edge runtimes, background workers, test runners, and streaming web frameworks. They cross promise chains, callbacks, tool adapters, async iterators, and package boundaries. A useful tracing tool must fit those execution models without breaking types or producing disconnected spans.

The question is therefore not only, “Does this tool have a TypeScript SDK?” It is, “Does it preserve how a TypeScript agent actually runs?”

The Runtime Is Part of the Product

Consider a small agent:

async function supportAgent(question: string) {
  const category = await classifyQuestion(question);
  const documents = await retrieveDocuments(question, category);
  return generateAnswer(question, documents);
}

The source looks sequential, but a production server may execute hundreds of these functions concurrently. A flat event stream cannot tell which retrieval or model call belongs to which request.

request A: classify -> retrieve -> generate
request B: classify -> retrieve -> generate

The tracer needs a request-level trace ID and a parent span for each nested operation. More importantly, those identifiers must remain available after every asynchronous handoff.

Async Context Must Be Correct Under Concurrency

In Node.js, AsyncLocalStorage is the usual foundation for request-scoped context. It propagates state through normal promise chains and many asynchronous resources, which is much safer than storing a current trace ID in a module-level variable.

import { AsyncLocalStorage } from 'node:async_hooks';

type TraceContext = {
  traceId: string;
  spanId: string | null;
};

const context = new AsyncLocalStorage<TraceContext>();

export function withTraceContext<T>(
  value: TraceContext,
  work: () => T,
): T {
  return context.run(value, work);
}

export function currentTraceContext(): TraceContext {
  const value = context.getStore();
  if (!value) throw new Error('No active trace context');
  return value;
}

await by itself does not cause context loss when AsyncLocalStorage is used correctly. Problems usually appear when instrumentation relies on global mutable state, registers work outside the active context, crosses an unsupported runtime boundary, or integrates with a library that manages its own scheduling.

A TypeScript tracing library should test at least these cases:

Multiple agent runs executing concurrently
Nested tools and model calls
Timers, event emitters, and queued callbacks
Detached background work
Retries that start new asynchronous branches
Tests running in parallel workers

The acceptance criterion is simple: every span belongs to exactly one trace and has the expected parent.

Streaming Changes the Span Lifecycle

Many AI routes return a stream before generation has finished. The HTTP handler may complete from the framework’s perspective while tokens, tool calls, and usage data are still in flight.

A model span should therefore not end merely because the route returned a Response. It should end when the stream completes, fails, or is cancelled.

type StreamHooks<T> = {
  onStart(): Promise<void> | void;
  onChunk(chunk: T): Promise<void> | void;
  onComplete(): Promise<void> | void;
  onError(error: unknown): Promise<void> | void;
  onCancel(): Promise<void> | void;
};

async function* traceStream<T>(
  source: AsyncIterable<T>,
  hooks: StreamHooks<T>,
): AsyncGenerator<T> {
  await hooks.onStart();
  let completed = false;

  try {
    for await (const chunk of source) {
      await hooks.onChunk(chunk);
      yield chunk;
    }

    completed = true;
    await hooks.onComplete();
  } catch (error) {
    await hooks.onError(error);
    throw error;
  } finally {
    if (!completed) await hooks.onCancel();
  }
}

Real integrations also need to avoid double-finalizing a span when an error and cancellation happen close together. They should record time to first chunk, completion status, tool activity, and final token usage without storing every chunk by default.

Streaming support is not a cosmetic feature. Without it, latency is measured incorrectly, cancellations disappear, and partial responses look like successful completions.

Runtime Compatibility Is a Matrix

“TypeScript runtime” can mean several different environments:

Environment	Important tracing constraint
Long-running Node.js service	Async context, concurrency, graceful flush on shutdown
Serverless function	Cold starts, short lifetime, bounded flush time
Edge runtime	Web APIs, limited or absent Node built-ins
Background worker	Detached jobs, queue context propagation, retries
Browser	Bundle size, user privacy, no server credentials
Test runner	Isolation across files and parallel workers

A library that imports node:async_hooks or node:fs from its main entry point may fail when bundled for an edge runtime even if those features are never called. Runtime-specific code should live behind explicit exports so bundlers can exclude it.

{
  "exports": {
    ".": "./dist/core.js",
    "./node": "./dist/node.js",
    "./web": "./dist/web.js"
  }
}

The core event model can be portable. Context propagation, persistence, and flush behavior may need runtime-specific implementations.

Framework Hooks Should Follow Real Lifecycles

Framework integrations are valuable when they attach to stable lifecycle hooks rather than patching private internals. For an AI route, useful boundaries include:

Request accepted and validated
Agent run started
Retrieval started and completed
Tool call requested, executed, retried, or rejected
Model stream opened, produced its first chunk, and completed
Client disconnected or cancelled
Final usage became available
Trace flush succeeded or timed out

Different frameworks expose these boundaries differently. A TypeScript-native tracer should make the manual API first-class, then add thin adapters for frameworks such as Vercel AI SDK, LangChain.js, or OpenAI Agents SDK. The adapter should translate framework events into one internal trace model instead of forcing the core to depend on every framework.

Type Preservation Is Part of Developer Experience

Instrumentation should not erase the function signature it wraps. A generic wrapper can preserve argument and return types:

type AsyncFunction<Args extends unknown[], Result> = (
  ...args: Args
) => Promise<Result>;

function withTracing<Args extends unknown[], Result>(
  name: string,
  fn: AsyncFunction<Args, Result>,
): AsyncFunction<Args, Result> {
  return async (...args: Args): Promise<Result> => {
    return traceStep(name, () => fn(...args));
  };
}

type AgentResult = {
  answer: string;
  sources: string[];
};

const tracedAgent = withTracing(
  'support_agent',
  async (question: string): Promise<AgentResult> => {
    return runAgent(question);
  },
);

const result = await tracedAgent('Where is my order?');
result.answer;  // string
result.sources; // string[]

Overloaded functions, methods that depend on this, and streaming return types need more careful adapters. A library should document those boundaries instead of falling back to any.

Strong types also improve trace quality. Tool names, span kinds, metadata fields, and completion states can be controlled unions, which catches instrumentation mistakes before runtime.

ESM, CommonJS, and Bundlers Still Matter

Modern TypeScript packages are consumed through ESM, CommonJS, transpilers, monorepo build systems, and framework bundlers. Tracing libraries are especially sensitive because they often initialize early and integrate across package boundaries.

A production-ready package should make these behaviors clear:

Which module formats are published and tested
Whether initialization has side effects
How duplicate package copies affect global registration
Whether Node-only modules can be excluded from web and edge bundles
How source maps affect stack and code-location metadata
Whether instrumentation works before and after bundling

Automatic monkey-patching can be convenient, but explicit wrappers and adapters are easier to reason about across module systems. If automatic instrumentation is offered, it should be optional and observable.

Vendor-Neutral Events Keep the Core Flexible

TypeScript-native does not need to mean backend-specific. The instrumentation layer can emit a small internal event model, while sinks translate those events to local files, OpenTelemetry, or a hosted platform.

type TraceSink = {
  write(event: TraceEvent): Promise<void>;
  flush(options?: { timeoutMs?: number }): Promise<void>;
};

This separation lets teams use local traces for development, short-lived artifacts in CI, and centralized observability in production without rewriting every adapter.

It also creates one place to enforce privacy policy. Framework integrations should emit approved metadata into the core; destinations should not decide what sensitive payloads to collect.

How to Evaluate a TypeScript Tracing Tool

Use representative tests rather than a single hello-world script.

Test	What a passing result looks like
Two concurrent requests	Separate trace trees with no mixed spans
Nested tool call	Correct parent and timing under the model or agent step
Streaming response	First-chunk, completion, error, and cancellation are distinct
Serverless invocation	Events flush within a bounded time without delaying every request
Edge build	Node-only modules are absent from the bundle
Type check	Wrapped functions preserve arguments and results
Parallel tests	Trace state is isolated by test and worker
Privacy check	Raw prompts and tool payloads are absent by default

Also inspect failure behavior. Observability should not crash the agent because a sink is unavailable, but silent data loss is not acceptable either. Libraries should expose dropped-event counts, flush failures, and backpressure policy.

A Better Definition of TypeScript-Native

A TypeScript-native tracing tool should be:

Async-aware: Correct under concurrent and nested execution.
Stream-aware: Accurate through completion, failure, and cancellation.
Runtime-aware: Explicit about Node, serverless, edge, browser, and worker support.
Framework-adaptable: Integrated through stable lifecycle hooks.
Type-preserving: Safe wrappers without unnecessary any.
Module-conscious: Predictable across ESM, CommonJS, and bundlers.
Privacy-conscious: Metadata-first with explicit payload capture.
Backend-flexible: Able to send one event model to multiple sinks.

An npm package is only the delivery mechanism. Native support is the accumulated quality of these runtime and developer-experience decisions.

Final Thought

TypeScript agents are asynchronous, streaming, and frequently deployed across more than one runtime. Their observability tools need to understand those constraints as first-class design inputs.

When evaluating a tracer, ask whether it preserves execution context, lifecycle, types, and privacy under the conditions your application actually uses. That is the difference between an SDK that compiles and instrumentation you can trust.

The next article will build the core mechanics directly: a TypeScript execution tree using AsyncLocalStorage, parent-child spans, safe completion handling, and pluggable trace sinks.

Metadata-Only Tracing: Privacy-First Observability for AI Agents

Raju Dandigam — Tue, 21 Jul 2026 19:14:31 +0000

Agent tracing is useful because it reveals execution structure: which step ran, which tool failed, where a retry occurred, how long a model call took, and how the token budget changed.

The easiest implementation is to capture every prompt, argument, result, and response. It is also the easiest way to turn an observability system into a second copy of sensitive application data.

Metadata-only tracing takes a different approach. It records the behavior of an agent without storing its raw payloads by default. The result is not zero-risk telemetry, but it is a much smaller and more governable data surface.

The Design Goal

A useful metadata trace should answer operational questions such as:

Which steps executed, and in what order?
Which model and tool operations succeeded or failed?
Where did latency accumulate?
How many retries and fallbacks occurred?
How many input, cached-input, and output tokens were used?
Did retrieval return results, and how much context was assembled?
Which validation or policy gate blocked the run?

It should not answer these questions unless a separate capture policy explicitly allows it:

What did the user say word for word?
What was the complete prompt or model response?
Which email address, account number, or authorization token was used?
What records or documents did a tool return?

That boundary keeps everyday traces useful without making full-fidelity capture the default.

Metadata-Only Does Not Mean Anonymous

Metadata can still be sensitive. A workflow name, fine-grained location, unique identifier, decision label, or rare error category may identify a person or reveal confidential business activity.

The relevant distinction is not “payload versus harmless metadata.” It is necessary, classified metadata versus unbounded content. Every field still needs a purpose, an owner, and a retention policy.

Avoid free-form metadata bags. A type such as Record<string, string | number> constrains value shapes, but it does not prevent a developer from adding email, prompt, or accessToken.

Define Metadata by Operation

Operation-specific types make the intended schema visible during code review and prevent arbitrary fields from spreading through the trace system.

type StepMetadata = {
  retrieval: {
    source: 'knowledge_base' | 'ticket_index';
    requestedTopK: number;
    resultCount: number;
    contextTokens: number;
  };
  model: {
    provider: 'openai' | 'anthropic' | 'google' | 'other';
    model: string;
    inputTokens: number;
    cachedInputTokens: number;
    outputTokens: number;
    finishReason: 'stop' | 'length' | 'tool' | 'other';
  };
  tool: {
    tool: 'lookup_order' | 'search_docs' | 'create_ticket';
    result: 'found' | 'not_found' | 'created' | 'rejected';
    retryCount: number;
  };
  policy: {
    policy: 'input_validation' | 'tool_authorization' | 'output_check';
    outcome: 'allow' | 'block';
    reason: 'valid' | 'invalid_shape' | 'not_authorized' | 'unsafe_output';
  };
};

type StepKind = keyof StepMetadata;

Controlled vocabularies are intentional. They make dashboards stable, reduce high-cardinality fields, and force new data collection to be reviewed as a schema change.

Model names may remain dynamic, but they should still be length-limited and normalized. User-controlled strings should not be copied into these fields.

Use a Small, Versioned Event Envelope

The trace envelope should support parent-child relationships without carrying application payloads.

type TraceStatus = 'ok' | 'error';

type StepCompleted<K extends StepKind = StepKind> = {
  version: 1;
  event: 'step_completed';
  traceId: string;
  spanId: string;
  parentSpanId: string | null;
  timestamp: string;
  kind: K;
  name: string;
  status: TraceStatus;
  durationMs: number;
  errorCategory?:
    | 'timeout'
    | 'rate_limit'
    | 'validation'
    | 'authorization'
    | 'dependency'
    | 'unknown';
  metadata?: StepMetadata[K];
};

Versioning matters because trace artifacts often outlive the code that produced them. A version field lets readers migrate or reject incompatible events rather than guessing their shape.

Do not include raw error messages or stack traces in the default event. Both frequently contain payload fragments, file paths, headers, or query values. Map exceptions to a controlled category and keep richer diagnostics behind a restricted capture mode.

Preserve Parent-Child Context in TypeScript

AsyncLocalStorage can carry trace and parent-span identifiers across promise chains without passing them through every function signature. The tracer below emits completion events and requires each operation to return metadata that matches its declared kind.

import { AsyncLocalStorage } from 'node:async_hooks';
import { randomUUID } from 'node:crypto';

type TraceContext = {
  traceId: string;
  spanId: string | null;
};

type StepOutput<T, K extends StepKind> = {
  value: T;
  metadata: StepMetadata[K];
};

export interface TraceSink {
  write(event: StepCompleted): Promise<void>;
}

const traceContext = new AsyncLocalStorage<TraceContext>();

function categorizeError(error: unknown): StepCompleted['errorCategory'] {
  if (!(error instanceof Error)) return 'unknown';
  if (error.name === 'AbortError') return 'timeout';
  if (error.name === 'ValidationError') return 'validation';
  if (error.name === 'AuthorizationError') return 'authorization';
  return 'dependency';
}

export async function runTrace<T>(
  work: () => Promise<T>,
): Promise<T> {
  return traceContext.run(
    { traceId: randomUUID(), spanId: null },
    work,
  );
}

export async function traceStep<K extends StepKind, T>(
  sink: TraceSink,
  kind: K,
  name: string,
  work: () => Promise<StepOutput<T, K>>,
): Promise<T> {
  const parent = traceContext.getStore();
  if (!parent) throw new Error('traceStep must run inside runTrace');

  const spanId = randomUUID();
  const startedAt = Date.now();

  try {
    const output = await traceContext.run(
      { traceId: parent.traceId, spanId },
      work,
    );

    await sink.write({
      version: 1,
      event: 'step_completed',
      traceId: parent.traceId,
      spanId,
      parentSpanId: parent.spanId,
      timestamp: new Date().toISOString(),
      kind,
      name,
      status: 'ok',
      durationMs: Date.now() - startedAt,
      metadata: output.metadata,
    });

    return output.value;
  } catch (error) {
    await sink.write({
      version: 1,
      event: 'step_completed',
      traceId: parent.traceId,
      spanId,
      parentSpanId: parent.spanId,
      timestamp: new Date().toISOString(),
      kind,
      name,
      status: 'error',
      durationMs: Date.now() - startedAt,
      errorCategory: categorizeError(error),
    });

    throw error;
  }
}

The sink can write to a local NDJSON file during development or export approved events to an observability backend. Capture policy belongs before the sink so changing destinations cannot silently increase what is collected.

Instrument an Agent Without Capturing Content

The model and retrieval operations can use sensitive values in memory while returning only bounded operational metadata to the tracer.

const answer = await runTrace(async () => {
  const documents = await traceStep(
    sink,
    'retrieval',
    'retrieve_support_docs',
    async () => {
      const value = await searchDocuments(userQuestion);

      return {
        value,
        metadata: {
          source: 'knowledge_base',
          requestedTopK: 5,
          resultCount: value.length,
          contextTokens: countDocumentTokens(value),
        },
      };
    },
  );

  return traceStep(
    sink,
    'model',
    'generate_support_answer',
    async () => {
      const response = await callModel(userQuestion, documents);

      return {
        value: response.text,
        metadata: {
          provider: 'other',
          model: response.model.slice(0, 80),
          inputTokens: response.usage.inputTokens,
          cachedInputTokens: response.usage.cachedInputTokens ?? 0,
          outputTokens: response.usage.outputTokens,
          finishReason: response.finishReason,
        },
      };
    },
  );
});

This trace can reveal an empty retrieval result, an oversized context, a length-limited response, or an unexpectedly expensive model call. It never needs the user question or document text.

What a Useful Trace Looks Like

support_agent  1,184 ms  ok
├─ retrieve_support_docs   96 ms  ok
│  source=knowledge_base resultCount=5 contextTokens=0
└─ generate_support_answer 1,072 ms ok
   inputTokens=640 cachedInputTokens=0 outputTokens=84 finishReason=stop

The zero-token retrieval context is immediately suspicious even though the trace does not expose the documents. Metadata narrows the investigation; a developer can then enable selective local capture for that step if the issue cannot be reproduced otherwise.

Enforce Runtime Limits Too

TypeScript types disappear at runtime, and trace data may come from JavaScript adapters or external libraries. Validate events before writing them:

Reject unknown keys and unsupported schema versions.
Cap names and other strings to a small maximum length.
Require finite, non-negative numeric values.
Limit metadata field counts and serialized event size.
Reject keys associated with payloads, credentials, headers, or free-form content.
Fail closed when validation cannot complete.

Schema validation libraries can enforce the structural rules. The operation-specific builders should still remain the primary policy boundary.

Test the Negative Requirements

Privacy requirements are often about what must never appear. Encode those expectations in tests.

const forbiddenKeys = [
  'prompt',
  'response',
  'args',
  'resultBody',
  'authorization',
  'cookie',
  'email',
] as const;

function assertNoForbiddenKeys(event: unknown): void {
  const serialized = JSON.stringify(event).toLowerCase();

  for (const key of forbiddenKeys) {
    if (serialized.includes(`"${key.toLowerCase()}"`)) {
      throw new Error(`Forbidden trace key: ${key}`);
    }
  }
}

Add representative secrets and personal data to test fixtures, run the agent, and assert that none of those values appear in the emitted trace. This is not a substitute for a broader security review, but it catches regressions when instrumentation changes.

Know When Metadata Is Not Enough

Metadata-only tracing is excellent for timing, topology, retries, token usage, policy outcomes, and broad error localization. It cannot explain every semantic failure.

When exact content is necessary, use a separate diagnostic mode with these constraints:

Enable it for a specific trace, step, or short time window.
Keep capture local or in a restricted incident environment.
Allowlist fields instead of recording complete objects.
Apply deterministic redaction and secret scanning.
Display that enhanced capture is active.
Delete the artifact automatically after a short retention period.

The escalation path should be obvious, but it should require intent.

A Practical Default

Start with a versioned event envelope, operation-specific metadata types, async parent-child context, controlled error categories, and runtime validation. Store timing, status, token usage, counts, and bounded labels. Keep prompts, outputs, tool payloads, retrieved text, headers, and environment values out of the default schema.

Metadata-only tracing will not answer every debugging question. It will answer a large portion of them while substantially reducing the amount of sensitive data your observability system must protect.

The next article will focus on the TypeScript runtime itself: async context propagation, module boundaries, serverless execution, and the hooks a tracing tool needs to handle cleanly.

Keep Your AI Agent Traces on Your Machine: A Local-First Approach

Raju Dandigam — Fri, 17 Jul 2026 19:32:17 +0000

Adding tracing to an AI agent changes more than the debugging experience. It also creates a new data path.

A trace can contain user messages, system instructions, retrieved documents, tool arguments, tool results, model output, identifiers, and internal business logic. Sending that trace to an external service may therefore be equivalent to sending the underlying application data.

The right first question is not simply, “Does this tracing SDK have a good dashboard?” It is, “What does it capture, where does it go, and who controls it?”

A local-first tracing design gives developers useful execution visibility while keeping export and sharing decisions explicit. It is not a rejection of hosted observability. It is a safer default for the development loop and a clearer boundary for sensitive data.

Why Agent Traces Need Their Own Data Policy

Traditional telemetry can also expose sensitive information, but AI traces are unusually payload-rich. The data needed to explain an agent’s behavior is often the same data the agent was asked to process.

Trace source	What it may reveal
User and model messages	Personal data, confidential requests, generated decisions
System prompts	Internal policies, workflow rules, security assumptions
Tool calls	Identifiers, query parameters, authorization context
Tool results	Customer records, database rows, third-party API data
Retrieval context	Private documents, proprietary knowledge, access-controlled text
Errors and retries	Stack traces, request fragments, infrastructure details

This does not mean that rich traces are always inappropriate. It means trace data needs an owner, a classification, a retention period, and an approved destination just like any other sensitive dataset.

What Local-First Tracing Means

In a local-first workflow, the initial trace is written to a developer machine, an isolated test workspace, a CI runner, or infrastructure controlled by the organization. Nothing is exported merely because tracing was enabled.

agent execution
    |
    v
capture policy
    |
    v
controlled local sink
    |
    +--> local inspection
    |
    +--> reviewed, reduced export --> approved shared platform

The important property is intentional movement. Developers can inspect the execution locally, reduce the data to what is needed, and share only an approved artifact.

Local-first is not the same as automatically secure. A laptop may be compromised, a workspace may be synchronized to a consumer cloud account, and a CI artifact may be visible to more people than expected. Local traces still require access controls, retention rules, and safe defaults.

Define Capture Modes Before Writing Code

A single on/off tracing switch is too coarse for most agents. Use explicit capture modes instead:

Mode	Captured data	Typical use
Metadata	Names, timing, status, token counts, safe categories	Default development and production telemetry
Selective payload	Approved fields from specific tools or steps	Focused debugging in a controlled environment
Full fidelity	Prompts, outputs, and raw payloads	Temporary incident investigation with short retention

Full-fidelity capture should require a deliberate configuration change, produce a visible warning, and expire automatically. It should never be enabled silently by a dependency update or a default environment variable.

Make the Trace Schema Safe by Construction

The most reliable redaction strategy is not capturing unnecessary data. A typed metadata event makes the safe path easy and prevents arbitrary payloads from drifting into normal traces.

type TraceStatus = 'ok' | 'error';
type TraceKind = 'agent' | 'model' | 'tool' | 'retrieval';
type SafeValue = string | number | boolean | null;

export type TraceEvent = {
  version: 1;
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  timestamp: string;
  kind: TraceKind;
  name: string;
  status: TraceStatus;
  durationMs?: number;
  inputTokens?: number;
  outputTokens?: number;
  metadata: Record<string, SafeValue>;
};

Notice what the schema does not include: raw prompts, free-form tool arguments, complete tool results, or authorization headers. Those fields require a separate, more restrictive capture path.

For an order lookup, prefer an allowlisted projection:

type OrderLookupResult = {
  found: boolean;
  status?: 'processing' | 'shipped' | 'delivered';
  itemCount?: number;
};

function orderLookupMetadata(
  result: OrderLookupResult,
): TraceEvent['metadata'] {
  return {
    resultFound: result.found,
    orderStatus: result.status ?? 'unknown',
    itemCount: result.itemCount ?? 0,
  };
}

This event can answer whether the lookup ran, succeeded, returned a record, and produced the expected state. It does not need an email address, street address, payment details, or the complete database response.

Write NDJSON to a Controlled Local Sink

Newline-delimited JSON is a practical local format because it is append-friendly, streamable, and easy to inspect with standard tools. The following sink creates a private directory and file on platforms that support POSIX permissions:

import { appendFile, mkdir } from 'node:fs/promises';
import { dirname, resolve } from 'node:path';

export class LocalTraceSink {
  constructor(
    private readonly filePath = resolve(
      '.agent-traces',
      'traces.ndjson',
    ),
  ) {}

  async write(event: TraceEvent): Promise<void> {
    await mkdir(dirname(this.filePath), {
      recursive: true,
      mode: 0o700,
    });

    await appendFile(
      this.filePath,
      `${JSON.stringify(event)}\n`,
      {
        encoding: 'utf8',
        flag: 'a',
        mode: 0o600,
      },
    );
  }
}

For concurrent or high-volume tracing, put writes behind a queue or use a storage engine designed for parallel writers. Also verify existing file permissions at startup because the mode option applies when a file is created; it does not repair an already permissive file.

Add the trace directory to .gitignore, keep it outside folders that synchronize automatically, and do not serve it from a development web root.

.agent-traces/

Use Stable Metadata, Not Hidden Payloads

Safe metadata should describe execution, not smuggle raw content under a different key. Useful fields include:

Operation and tool names from a controlled vocabulary
Success, failure, timeout, or cancellation status
Duration, retry count, and fallback usage
Model name and token usage
Retrieval result count and score range
Input and output size in bytes or tokens
Validation result and error category
Approved business categories such as order_status

Be careful with hashes. A plain hash of a predictable identifier, such as an email address or short account number, can often be reversed by guessing. If cross-trace correlation is necessary, use a keyed HMAC managed outside the trace store, rotate the key, and document who can perform the correlation.

Treat Redaction as Defense in Depth

Regex replacement is useful for catching obvious patterns, but it is not a complete privacy control. Addresses, names, medical details, credentials, and business-sensitive text do not all have reliable patterns.

A stronger export pipeline works in this order:

Start from a metadata-only event schema.
Allowlist the exact payload fields approved for a specific workflow.
Apply structured redaction by field name and data type.
Scan the resulting artifact for secrets and sensitive patterns.
Fail closed when the scan or policy check cannot complete.
Require review before external or public sharing.

Redaction should produce a new artifact. Preserve the restricted source only for its approved retention period, and never overwrite it in a way that makes the audit trail ambiguous.

Protect the Local Boundary

Local traces deserve the same operational hygiene as other sensitive developer data:

Restrict directory and file permissions.
Encrypt the disk and lock the development session when unattended.
Keep traces out of source control, terminal paste services, and automatic backups unless those systems are approved.
Store encryption keys and service credentials outside trace files.
Apply a short default retention period and provide an explicit cleanup command.
Record when full-fidelity capture is enabled and who enabled it.
Disable payload capture in shared development environments by default.

Deleting old traces is part of the design, not housekeeping. A useful default is to expire development traces after hours or days, not months.

Build a Safe CI Workflow

CI is controlled infrastructure, but it is not necessarily private. Logs and artifacts may be available to repository contributors, external pull requests, support personnel, or integrated services.

A safer CI flow is:

test fails
  -> metadata trace stays in the isolated workspace
  -> export policy creates a reduced artifact
  -> secret and sensitive-data scans run
  -> artifact uploads only when policy passes
  -> artifact expires after a short retention window

Avoid printing raw traces to the build log. Apply extra restrictions to workflows triggered from forks, and make artifact access and retention explicit in repository settings.

Use Hosted Observability Intentionally

Production systems often need centralized metrics, alerts, trace correlation, and team-wide dashboards. Local-first tracing can coexist with those needs.

A practical split is:

Development: Metadata is stored locally; temporary payload capture is opt-in.
Pull requests: Only reduced, scanned artifacts are shared with reviewers.
Production: Approved metadata is exported to the observability platform; restricted payload capture uses a separate incident process.

Before exporting agent telemetry, review the destination’s access model, retention controls, deletion behavior, regional storage, subprocessors, and incident response process. Ensure the export matches your organization’s security and privacy requirements.

Questions to Answer Before Enabling Tracing

What data can each capture mode collect?
Does the SDK upload anything automatically?
Can raw prompts, outputs, and tool payloads be disabled?
Where are local and CI traces physically stored?
Who can access them, and how is that access revoked?
How quickly are traces deleted?
What policy controls export to a shared platform?
Can developers verify the exact artifact before external sharing?
What happens if redaction or scanning fails?

If the team cannot answer these questions, it is too early to enable full-fidelity tracing.

Final Thought

Observability should explain an agent’s execution without becoming an accidental copy of everything the agent touched. Start with typed metadata, keep the initial trace inside a controlled boundary, and make richer capture temporary and explicit.

The principle is simple: capture the minimum, inspect locally, and export intentionally.

The next article in this series will go deeper into metadata-only tracing: how to preserve execution structure, latency, token usage, retries, and error context without storing raw prompts or tool payloads.

Token Drift Explained: Why Your Agent Gets Slower and More Expensive

Raju Dandigam — Fri, 17 Jul 2026 00:20:06 +0000

Your agent feels fast during a demo. Then a real session reaches twenty turns, several tools have returned large payloads, and every response starts taking longer and costing more.

This pattern is often called token drift: the effective input context grows as an agent carries more conversation history, tool output, retrieved documents, and state into each model call. The model itself is not gradually becoming less efficient. The application is asking it to process more material on every turn.

Token drift is manageable, but only if context is treated as a budgeted system resource rather than an unlimited transcript.

What Token Drift Actually Means

Most conversational agents build each request from several sources: a system prompt, tool definitions, recent messages, retrieved context, durable memory, and sometimes a summary of older work. Whether the application sends that state on every request or a provider manages part of it, the model still has an effective context to process.

Consider an illustrative session:

Turn	Effective input	What changed
1	1,200 tokens	System prompt, tools, and one user message
8	6,900 tokens	Conversation history and two tool results
20	18,400 tokens	More history, retrieved documents, and accumulated state

The exact price and latency depend on the model, provider, cache behavior, and workload. The important signal is the trend: later calls repeatedly process a larger context.

Why Agents Accumulate Tokens So Quickly

Conversation history is only one source of growth. Production agents often accumulate tokens in several places at once:

Repeated transcripts: Every prior user and assistant message remains in context.
Tool schemas: Large tool descriptions and JSON schemas may be attached to every model call.
Tool results: Search results, stack traces, database rows, and API responses can be much larger than the user's request.
Retrieved documents: RAG pipelines sometimes add too many chunks or keep stale retrieval results across turns.
Retries and repairs: Failed tool calls and validation errors may be appended even after they stop being useful.
Duplicated memory: A summary, structured state, and raw transcript may all contain the same facts.

This is why limiting the number of chat messages is useful but incomplete. A ten-message conversation can still be expensive if one tool returned 30,000 tokens of JSON.

The Cost Curve

Suppose a request has a fixed cost of s tokens for system instructions and tools, and each turn adds roughly m tokens of new history. The input on turn t is approximately:

input(t) = s + (t * m)

Per-turn input grows roughly linearly. However, the cumulative input processed across n turns is approximately:

session input = (n * s) + m * n * (n + 1) / 2

That second term is quadratic. This distinction matters: the final request is not necessarily exponentially larger, but repeatedly resending a growing transcript can make the total session cost rise much faster than the number of turns.

Prompt caching can reduce the cost of repeated prefixes on providers that support it. It does not remove context-window limits, irrelevant-history problems, or the need to control tool and retrieval payloads.

Measure Drift Before You Trim

Use the token usage returned by the model API whenever possible. Character-based estimates are acceptable for admission control, but they are not a reliable billing metric across models or languages.

Track usage per model call, not only per user request. One agent turn may contain planning, tool calls, retries, and a final response.

type UsageSample = {
  turn: number;
  step: string;
  inputTokens: number;
  outputTokens: number;
  cachedInputTokens?: number;
  durationMs: number;
};

function averageInputGrowth(samples: UsageSample[]): number {
  if (samples.length < 2) return 0;

  const byTurn = new Map<number, number>();
  for (const sample of samples) {
    byTurn.set(
      sample.turn,
      (byTurn.get(sample.turn) ?? 0) + sample.inputTokens,
    );
  }

  const totals = [...byTurn.entries()]
    .sort(([a], [b]) => a - b)
    .map(([, tokens]) => tokens);

  if (totals.length < 2) return 0;

  const deltas = totals.slice(1).map((value, index) => {
    return value - totals[index];
  });

  return Math.round(
    deltas.reduce((sum, delta) => sum + delta, 0) / deltas.length,
  );
}

A useful trace should also record the model, operation, tool name, retry count, retrieval chunk count, and whether cached input was used. Avoid recording sensitive prompt content unless your privacy policy explicitly permits it.

Use a Context Budget, Not a Message Limit

A sliding window is a reasonable fallback, but a token budget is the stronger control because message sizes vary dramatically. Reserve space for the output and for any tool calls the model may make, then allocate the remainder to fixed context and recent turns.

Treat a complete agent turn as the unit of removal. Dropping an individual tool result while keeping the assistant message that refers to it can leave an invalid or confusing transcript.

type AgentMessage = {
  role: 'user' | 'assistant' | 'tool';
  content: string;
};

type ConversationTurn = {
  id: string;
  messages: AgentMessage[];
  tokenCount: number;
};

type ContextPlan = {
  turns: ConversationTurn[];
  omittedTurnIds: string[];
  estimatedInputTokens: number;
};

type ContextBudget = {
  maxContextTokens: number;
  reservedOutputTokens: number;
  systemPromptTokens: number;
  toolSchemaTokens: number;
  summaryTokens: number;
  retrievedContextTokens: number;
};

function planContext(
  turns: ConversationTurn[],
  budget: ContextBudget,
): ContextPlan {
  const fixedTokens =
    budget.systemPromptTokens +
    budget.toolSchemaTokens +
    budget.summaryTokens +
    budget.retrievedContextTokens;

  let remaining =
    budget.maxContextTokens -
    budget.reservedOutputTokens -
    fixedTokens;

  if (remaining <= 0) {
    throw new Error('Fixed context exceeds the available input budget');
  }

  const selected: ConversationTurn[] = [];

  for (let index = turns.length - 1; index >= 0; index -= 1) {
    const turn = turns[index];

    if (turn.tokenCount > remaining) break;

    selected.unshift(turn);
    remaining -= turn.tokenCount;
  }

  if (selected.length === 0 && turns.length > 0) {
    throw new Error('The latest turn is larger than the context budget');
  }

  const selectedIds = new Set(selected.map((turn) => turn.id));
  const omittedTurnIds = turns
    .filter((turn) => !selectedIds.has(turn.id))
    .map((turn) => turn.id);

  return {
    turns: selected,
    omittedTurnIds,
    estimatedInputTokens:
      fixedTokens +
      selected.reduce((sum, turn) => sum + turn.tokenCount, 0),
  };
}

The token counts in this example should come from the provider's tokenizer or a compatible tokenizer for the chosen model. Recalculate the plan whenever tools, retrieval results, or the summary changes.

Separate Conversation From State

Not every fact belongs in the transcript. Long-running agents become more reliable when they store important state explicitly:

type TaskState = {
  objective: string;
  constraints: string[];
  decisions: Array<{
    choice: string;
    reason: string;
  }>;
  openQuestions: string[];
  artifactIds: string[];
};

Structured state is compact, inspectable, and easier to validate than a prose summary. It also prevents a critical decision from disappearing simply because an old message fell outside the recent window.

A practical context often has four layers:

Stable instructions: The system prompt and safety constraints.
Structured task state: Goals, decisions, identifiers, and unresolved work.
Compressed history: A summary of older turns.
Recent turns: The verbatim messages and tool exchanges needed for continuity.

This layered design is usually more useful than importance-scoring individual messages. Heuristic scores can be difficult to explain, and they may discard a quiet but essential constraint.

Summarize Deliberately

Summarization is valuable when sessions last long enough that a recent window loses necessary context. It should not run blindly on every turn.

Create or refresh a summary only when older turns cross a threshold. Ask the summarizer to preserve facts, decisions, constraints, failures, unresolved questions, and references to external artifacts. Store the summary separately from the raw transcript so it can be reviewed or regenerated.

Summaries are lossy and can introduce errors. Keep canonical values such as account IDs, approval state, financial amounts, and file paths in structured storage rather than trusting generated prose.

Bound Tools and Retrieval Too

Many apparent memory problems are actually payload problems. A useful context policy should also:

Send only the tool definitions available in the current state.
Replace large tool results with a compact, typed projection.
Store full results externally and keep an ID or URL in context.
Limit retrieval by both relevance and token budget.
Deduplicate overlapping chunks before adding them to the prompt.
Remove failed retries once their diagnostic value has expired.

For example, an agent rarely needs an entire database response. It may need five selected fields, the result count, and an identifier it can use to fetch more data later.

Choose a Strategy by Workload

Workload	Good starting strategy
Short support chat	Recent-turn window with a hard input budget
Tool-heavy workflow	Compact tool schemas and bounded results
Long research session	Structured state, summary, and recent turns
Knowledge assistant	Token-budgeted retrieval plus recent turns
High-stakes workflow	Structured state with validation and auditable summaries

Most systems do not need a sophisticated memory-ranking algorithm on day one. They need visibility, a budget, and clear rules for what may enter context.

Production Quality Gates

Context reduction is successful only if it lowers resource use without damaging task quality. Compare the policy against representative sessions and track:

Input tokens per turn and per completed task
Cached versus uncached input tokens
Model-call count, retries, and tool failures
Median and tail latency
Task completion and human correction rates
Missing facts or constraints after compaction
Summary refresh frequency and cost

Run deterministic regression cases for critical facts and tool sequences. For example, verify that an agent still remembers an approval constraint after the originating turn has moved into summarized history.

A Practical Baseline

Start with a simple policy: measure real usage, reserve output headroom, cap the input context, keep structured task state, retain a small window of complete recent turns, and summarize older history only when necessary. Bound tool and retrieval payloads independently.

Token drift is not mysterious model degradation. It is accumulated application state. Once that state is measured and budgeted, cost and latency become predictable engineering trade-offs instead of late-session surprises.

The Hidden Cost of AI Agents: Tokens, Tools, Retries, and Latency

Raju Dandigam — Wed, 15 Jul 2026 22:48:10 +0000

AI agents look simple at first.

You take a model, add a prompt, maybe connect a tool, and it works. It feels like you are just making one API call and getting an answer.

That illusion disappears the moment the agent starts doing real work.

In production, an agent is not one call. It is a loop. It may call the model multiple times, retrieve memory, execute tools, retry on failure, and refine its own output.

That is where cost shows up.

Not just in tokens, but in latency, complexity, and system load.

The mental model most people miss

Most tutorials present agents like this:

User → Model → Response

Real systems look more like this:

User → Runtime → Model → Tools → Memory → Validation → Model → Response

And that flow may repeat several times.

Each step adds cost.

Not just money. Time, complexity, and failure risk.

1. Token cost is not just one request

The first hidden cost is tokens.

Developers often assume a single request with a fixed prompt. In reality, an agent may call the model multiple times within one run.

For example:

One call to decide what to do
One call after retrieving memory
One call after tool execution
One call to generate the final response

Each call includes input tokens and output tokens.

If you also include memory retrieval, the prompt gets larger over time. That increases token usage even further.

A simple way to think about it:

const totalTokenCost =
  (numberOfCalls) *
  (inputTokens + outputTokens);

The problem is that numberOfCalls is not always predictable. It depends on how the agent behaves.
This is why cost can grow faster than expected.

2. Tool calls are not free

Tool usage is often treated as a free extension of the model. It is not.
Each tool call adds:
Network latency
External API cost
Failure scenarios
Additional model calls after execution
For example, a simple flow might look like:

Model decides → Call tool → Tool responds → Model interprets → Continue

Even if the tool itself is cheap, the surrounding orchestration is not.
In many cases, the cost of using a tool is not the tool itself, but the extra model calls and latency it introduces.

3. Retries multiply everything

Retries are where costs start to compound.
If a tool fails, the agent may retry. If the model returns invalid output, the system may retry. If validation fails, the system may retry again.
Each retry is not just one extra call. It repeats the entire step.
A simple retry loop might look like this:

for (let i = 0; i < 3; i++) {
  try {
    return await callTool(input);
  } catch {
    continue;
  }
}

Now imagine this happening inside an agent loop.
One failure can lead to:
Multiple tool calls
Multiple model calls
Longer execution time
Retries are necessary, but without limits, they become expensive quickly.

4. Latency grows with each step

Latency is often underestimated.
Each model call takes time. Each tool call adds network delay. Each retry increases total execution time.
Even if each step is fast, the combined latency can be noticeable.
A simple breakdown:
Model call: 500ms to 2s
Tool call: 200ms to 1s
Memory retrieval: 100ms to 300ms
Now combine them across multiple steps.
An agent that feels “instant” in a demo can easily take several seconds in production.
This is not always a problem, but it becomes important for user experience.

5. Memory adds both value and cost

Memory is powerful, but it is not free.
Every time you retrieve memory, you add:
Query cost (vector search or database lookup)
Additional tokens (more context sent to the model)
Complexity in prompt construction
If memory is not filtered carefully, it can:
Increase token usage significantly
Add irrelevant context
Reduce model accuracy
The key is not more memory, but better memory.
Retrieve only what is relevant to the current goal.

6. Reflection and “thinking” steps increase usage

Many modern agent patterns include reflection.
The agent may:
Evaluate its own output
Re-plan its next step
Summarize intermediate results
These patterns improve quality, but they also add more model calls.
For example:

Model → Draft response
Model → Critique response
Model → Refine response

This can double or triple token usage.
Reflection is useful, but it should be used intentionally.

7. Cost is not just money

When people talk about cost, they usually mean API pricing.
In practice, cost also includes:
Latency. Slow agents reduce user experience.
System load. More calls mean more infrastructure usage.
Failure surface. More steps increase the chance of something going wrong.
Debugging complexity. More moving parts make issues harder to trace.
This is why cost should be treated as an architectural concern, not just a billing concern.

A simple way to think about it

If I had to simplify everything into one idea, it would be this:
Each new capability multiplies cost.
More tools → more calls
More memory → more tokens
More retries → more loops
More reasoning → more model usage
Agents do not scale linearly. They scale multiplicatively.

What actually works in practice

In real TypeScript systems, I try to keep things controlled.
Limit the number of steps. Do not let the agent run indefinitely.
Keep prompts small. Only include relevant context.
Use tools intentionally. Do not expose everything.
Set retry limits. Avoid infinite loops.
Track usage. Measure tokens, latency, and failures.
These are not optimizations. They are guardrails.

The real takeaway

AI agents are powerful, but they are not free abstractions.
Every decision you add to the system has a cost. Every layer adds complexity. Every retry multiplies usage.
The goal is not to remove these features. The goal is to use them intentionally.
A simple, controlled agent that solves a specific problem is often more valuable than a complex agent that tries to do everything.
Because in production systems, efficiency and reliability matter more than flexibility.
And understanding the hidden cost is the first step toward building something that actually scales.

How to Use MCP in a Real TypeScript Agent (Minimal Example)

Raju Dandigam — Wed, 15 Jul 2026 05:54:23 +0000

MCP sounds complicated until you actually use it.

Most explanations talk about protocols, standards, and architecture diagrams. What developers really want is simpler: how does this look in a real agent?

This post walks through a minimal TypeScript example that shows where MCP fits in an actual system.

The goal is not to build a full framework. The goal is to understand the flow.

What we are building

We will build a simple agent that answers a question like:

“Find me flights to New York.”

The agent will:

Decide it needs a flight search tool
Call that tool through an MCP-style interface
Return the result

The important part is not the feature. It is the structure.

The architecture in one view

flowchart LR
    A[Agent Runtime] --> B[MCP Layer]
    B --> C[Tools]
    C --> D[Flight API]

The runtime decides what to do. MCP provides a consistent way to call tools. The tools do the actual work.

Step 1: Define a simple tool

First, define a tool the agent can use. In a real system, this could come from an MCP server. For now, we will define it locally.

type FlightSearchInput = {
  from: string;
  to: string;
};

const flightSearchTool = {
  name: "search_flights",
  execute: async (input: FlightSearchInput) => {
    return {
      flights: [
        { airline: "Delta", price: 320 },
        { airline: "United", price: 290 }
      ]
    };
  }
};

This is intentionally simple. The important part is that the tool has a name and an execute function.

Step 2: Create a minimal MCP layer

The MCP layer is just a consistent interface for calling tools.
In a real system, this would connect to external MCP servers. For this example, we simulate that behavior.

class MCPClient {
  private tools = new Map<string, any>();

  register(tool: any) {
    this.tools.set(tool.name, tool);
  }

  async callTool(name: string, input: unknown) {
    const tool = this.tools.get(name);

    if (!tool) {
      throw new Error(`Tool not found: ${name}`);
    }

    return tool.execute(input);
  }
}

This is the key idea. The agent does not call tools directly. It calls them through MCP.

Step 3: Add a simple model decision

We will simulate the model deciding what to do.
In a real system, this would be an LLM call returning structured output.

type Decision = {
  action: "call_tool" | "finish";
  toolName?: string;
  toolInput?: unknown;
};

async function decideNextStep(userInput: string): Promise<Decision> {
  if (userInput.includes("flight")) {
    return {
      action: "call_tool",
      toolName: "search_flights",
      toolInput: { from: "SFO", to: "NYC" }
    };
  }

  return { action: "finish" };
}

The important part is structure. The model returns a decision, not just text.

Step 4: Build the runtime

Now we connect everything.
The runtime asks for a decision, validates it, and executes the tool through MCP.

async function runAgent(userInput: string) {
  const mcp = new MCPClient();
  mcp.register(flightSearchTool);

  const decision = await decideNextStep(userInput);

  if (decision.action === "call_tool") {
    const result = await mcp.callTool(
      decision.toolName!,
      decision.toolInput
    );

    return {
      message: "Here are your flights",
      data: result
    };
  }

  return {
    message: "No action needed"
  };
}

That is the full flow.
The runtime decides what to do. MCP executes the tool. The result is returned.

What this example actually shows

This example is intentionally small, but it highlights the most important idea.
MCP is not the agent.
MCP is not the runtime.
MCP is not the decision layer.
MCP is the connection layer.
It standardizes how tools are called.
Everything else is still your responsibility.

What is missing (on purpose)

This example skips several things you will need in a real system.
Authorization. Not every user should be able to call every tool.
Validation. Tool inputs should be validated before execution.
Risk control. Some tools should require approval.
Retries. External tools can fail.
Observability. You need to know what happened during a run.
Memory. The agent should not be stateless.
These are not MCP problems. These are architecture problems.

Where MCP actually helps

MCP becomes valuable when your system grows.
Instead of manually wiring every tool, your agent can discover and call tools through a consistent interface.
That becomes especially useful when:
You have many tools across different services
You want reusable integrations
You are building multi-agent systems
You want to standardize tool access across teams
At that point, MCP reduces complexity.

Final takeaway

If you strip away the hype, MCP is simple.
It gives your agent a consistent way to reach tools.
It does not decide what to do.
It does not validate actions.
It does not make your system safe or reliable.
Your runtime still owns those responsibilities.
Once you understand that boundary, MCP becomes a very useful building block instead of a confusing abstraction.
And in real systems, that clarity matters more than anything else.

MCP for TypeScript Developers: What It Actually Solves Beyond the Hype

Raju Dandigam — Mon, 13 Jul 2026 14:14:37 +0000

MCP is one of the most talked-about ideas in AI right now.

If you read enough posts, it starts to sound like MCP is the missing piece that makes agents smarter, more capable, and production-ready.

It is not.

MCP does not make your agent smarter. It does not fix bad prompts. It does not give you memory, validation, or reliability.

What MCP actually does is much simpler, and much more important.

It standardizes how your agent connects to tools.

Once you see it that way, it becomes much easier to understand where MCP fits and where it does not.

The real problem MCP is trying to solve

Before MCP, connecting an agent to tools was messy.

Every integration looked different. Each API had its own format. Each tool had its own authentication, input shape, and execution pattern. If you wanted your agent to talk to five systems, you had to write five different integrations.

That made agents harder to build and harder to scale.

MCP solves this by introducing a consistent interface between agents and external capabilities.

Instead of writing custom glue code for every tool, the agent interacts with tools through a standard protocol. That makes tool discovery, invocation, and integration more predictable.

That is the real value of MCP.

Not intelligence. Not reasoning.

Just consistency.

MCP is plumbing, not intelligence

It helps to think of MCP the same way you think about HTTP.

HTTP did not make applications smarter. It made communication between systems consistent. It allowed browsers, servers, and APIs to talk to each other in a standard way.

MCP plays a similar role for AI agents.

It defines how tools are exposed and how agents can interact with them.

That is extremely useful, but it is also limited.

If your agent makes poor decisions, MCP will not fix that.

If your agent calls the wrong tool, MCP will not stop it.

If your agent produces invalid data, MCP will not validate it.

Those problems still belong to your application.

Where MCP fits in a TypeScript architecture

In a real TypeScript AI system, MCP sits at the boundary between your agent and external tools.

It is not the runtime. It is not the decision layer. It is not the memory system.

It is the connection layer.

A simple mental model looks like this:

Agent Runtime → MCP Layer → Tools / APIs / Services

The runtime decides what to do next. MCP provides a consistent way to execute that decision against external systems.

That is it. Everything else still belongs to your architecture.

Your application still owns control

One of the most common misconceptions is that MCP replaces the need for application logic. It does not.

Even if a tool is available through MCP, your system still needs to decide whether that tool should be used in the current context.

That includes:

Authorization. Is the user allowed to perform this action?
Validation. Is the input safe and well-formed?
Risk control. Does this action require human approval?
Retries. What happens if the tool fails?
Observability. How do you trace what the agent did?

MCP standardizes access, but it does not enforce rules.
Your application still owns those decisions.

Tools still need contracts

Even with MCP, tools should not be treated as open-ended capabilities.
In TypeScript, I still think of tools as contracts with defined inputs, outputs, and risk levels.

type Tool = {
  name: string;
  risk: "low" | "high";
  execute: (input: unknown) => Promise<unknown>;
};

MCP may expose a tool, but your system should still decide whether to register it, when to allow it, and how to validate its inputs.
The model can request a tool call.
The system decides whether to execute it.
That boundary does not change.

MCP makes scale easier, not logic simpler

Where MCP really helps is scale. If your agent needs to integrate with multiple systems, MCP reduces the friction of connecting to those systems. It allows tools to be discovered and used in a consistent way.

That becomes especially useful when:

You are integrating with many internal services
You want reusable tool definitions across teams
You are building platforms where agents need access to shared capabilities
You want to avoid rewriting integration logic for every new agent
In these cases, MCP becomes a strong foundation.

But even at scale, it does not simplify your core logic.
Your runtime still needs to decide what to do.

A simple example in context

Think about a travel planning agent. The model decides it needs flight data. Through MCP, the agent can discover a flight search tool and call it in a standardized way. That is helpful.

But the system still needs to decide:

Is the user allowed to access this data?
Should this request be rate limited?
What happens if the tool returns incomplete data?
Should the agent retry or choose a different tool?
Should the result be validated before continuing?

MCP makes the connection easier. It does not answer these questions.

Where MCP fits with other tools

MCP is often discussed alongside frameworks like Vercel AI SDK, LangGraph, or OpenAI Agents SDK.

They solve different problems.

Vercel AI SDK helps with model interaction, streaming UI, and tool calling in TypeScript.
LangGraph helps with stateful workflows, branching logic, and long-running agent flows.
OpenAI Agents SDK provides structured primitives for agents, tools, and guardrails.

MCP fits below all of these.
It is the layer that standardizes how tools are exposed and consumed.
You still need the rest of the system.

When you should use MCP

MCP is a good fit when your system needs to connect to multiple tools in a consistent way.

It is especially useful when you are building platforms, internal tooling ecosystems, or multi-agent environments where shared capabilities matter.

If your agent only calls one or two APIs, MCP may not add much value yet.

If your system is growing and integrations are becoming messy, MCP becomes much more useful.

The real takeaway

MCP is not the missing piece that makes AI agents work. It is the missing piece that makes tool access consistent. That may not sound exciting, but it is exactly what many systems need. The mistake is expecting MCP to solve problems that belong elsewhere.

It will not fix reasoning.
It will not enforce safety.
It will not add memory.
It will not make your agent reliable.

Your architecture still needs to handle all of that. If you treat MCP as plumbing, it becomes very powerful. If you treat it as intelligence, it will disappoint you and in real systems, understanding that difference matters more than adopting any new standard.

5 Ways Your AI Agent Will Fail (And How to Prevent Them)

Raju Dandigam — Thu, 09 Jul 2026 16:35:00 +0000

Your agent works in testing. Then you deploy it and things break in ways you didn't expect. Here are five failure modes I've seen repeatedly, with TypeScript code to prevent each one.

1. The Infinite Loop

What happens:
Your agent calls a tool, doesn't get the result it wants, so it tries again. And again. And again. Overnight, you've made 10,000 API calls.

Why it happens:
No termination condition. The agent keeps trying until it "succeeds," but success is poorly defined or impossible to achieve.

Real scenario:

// Agent tries to search for "the perfect answer"
// Each search doesn't satisfy it, so it keeps searching
// 1000 searches later, still going...

The fix: Max iterations + success criteria

class BudgetedAgent {
 private callCount = 0;
 private maxCalls = 10;


 async run(task: string): Promise<string> {
   this.callCount = 0;


   while (this.callCount < this.maxCalls) {
     const response = await this.executeStep(task);


     this.callCount++;


     // Check if we're done
     if (this.isComplete(response)) {
       return response;
     }


     // Check if we should stop trying
     if (this.callCount >= this.maxCalls) {
       throw new Error(
         `Max iterations (${this.maxCalls}) reached without completion`
       );
     }
   }


   throw new Error('Task incomplete');
 }


 private isComplete(response: any): boolean {
   // Define clear completion criteria
   return (
     response.status === 'success' ||
     response.confidence > 0.8 ||
     response.finalAnswer !== null
   );
 }
}

Better: Track why it's looping

class SmartAgent {
 private attempts: Map<string, number> = new Map();
 private maxRetries = 3;


 async executeWithRetry(tool: string, args: any): Promise<any> {
   const key = `${tool}:${JSON.stringify(args)}`;
   const attemptCount = this.attempts.get(key) || 0;


   if (attemptCount >= this.maxRetries) {
     throw new Error(
       `Tool ${tool} failed after ${this.maxRetries} attempts with same args`
     );
   }


   this.attempts.set(key, attemptCount + 1);


   try {
     return await this.executeTool(tool, args);
   } catch (error) {
     // Log the failure reason
     console.error(`Attempt ${attemptCount + 1} failed:`, error);
     throw error;
   }
 }
}

Lesson: Always set hard limits. Don't trust the agent to know when to stop.

2. Context Window Overflow

What happens:
Your agent starts fast. After 20 turns, responses take 10 seconds and cost significantly more. Eventually, it crashes with "context length exceeded."

Why it happens:
Every API call sends the entire conversation history. As history grows, so does input token count. Eventually, you hit the model's context limit.

Real example:

Turn 1:  100 tokens input →  50 tokens output = 150 total
Turn 5:  500 tokens input → 200 tokens output = 700 total
Turn 10: 1200 tokens input → 400 tokens output = 1600 total
Turn 20: 4000 tokens input → context limit exceeded

The fix: Sliding window

interface Message {
 role: 'user' | 'assistant';
 content: string;
}


class ContextManager {
 private messages: Message[] = [];
 private maxMessages = 20;
 private preserveSystemPrompt = true;


 add(message: Message) {
   this.messages.push(message);


   // Keep within limit
   if (this.messages.length > this.maxMessages) {
     // Always keep the first message if it's system context
     if (this.preserveSystemPrompt) {
       const systemMessage = this.messages[0];
       this.messages = [
         systemMessage,
         ...this.messages.slice(-this.maxMessages + 1),
       ];
     } else {
       this.messages = this.messages.slice(-this.maxMessages);
     }
   }
 }


 getMessages(): Message[] {
   return this.messages;
 }


 estimateTokens(): number {
   // Rough estimate: ~4 characters per token
   const totalChars = this.messages.reduce(
     (sum, msg) => sum + msg.content.length,
     0
   );
   return Math.ceil(totalChars / 4);
 }
}

Better: Token-aware trimming

class TokenAwareContext {
 private messages: Message[] = [];
 private maxTokens = 100000;


 add(message: Message) {
   this.messages.push(message);
   this.trim();
 }


 private trim() {
   while (this.estimateTokens() > this.maxTokens && this.messages.length > 1) {
     // Remove oldest message (keep at least 1)
     this.messages.splice(1, 1);
   }
 }


 private estimateTokens(): number {
   return this.messages.reduce((sum, msg) => {
     return sum + Math.ceil(msg.content.length / 4);
   }, 0);
 }


 getMessages(): Message[] {
   return this.messages;
 }
}

Lesson: Context is not infinite. Manage it actively.

3. Hallucinated Tool Names

What happens:
The agent tries to call search_databse instead of search_database. Your code crashes because the tool doesn't exist. Or worse, it silently fails and the agent keeps trying.

Why it happens:
LLMs sometimes misspell tool names, especially if:

The name is long or complex
Similar tool names exist
The model is tired (later in context)

Real examples I've seen:

search_database → searchDatabase (different casing)
get_user_data → getUserData → get-user-data (inconsistent naming)
calculate_sum → calculate_total (synonym hallucination)

The fix: Type-safe tool registry

const TOOLS = {
 search_database: {
   description: 'Search the database',
   handler: async (args: any) => { /* ... */ },
 },
 send_email: {
   description: 'Send an email',
   handler: async (args: any) => { /* ... */ },
 },
} as const;


type ToolName = keyof typeof TOOLS;


function invokeTool(name: string, args: any): Promise<any> {
 // Check if tool exists
 if (!(name in TOOLS)) {
   throw new Error(`Tool "${name}" does not exist`);
 }


 return TOOLS[name as ToolName].handler(args);
}

Better: Fuzzy matching with suggestions

import { distance } from 'fastest-levenshtein';


class ToolRegistry {
 private tools = new Map<string, Tool>();


 register(name: string, tool: Tool) {
   this.tools.set(name, tool);
 }


 async invoke(name: string, args: any): Promise<any> {
   // Exact match
   if (this.tools.has(name)) {
     return await this.tools.get(name)!.execute(args);
   }


   // Find similar names
   const similar = this.findSimilar(name, 2); // max edit distance of 2


   if (similar.length === 1) {
     console.warn(`Tool "${name}" not found. Using "${similar[0]}" instead.`);
     return await this.tools.get(similar[0])!.execute(args);
   }


   if (similar.length > 1) {
     throw new Error(
       `Tool "${name}" not found. Did you mean: ${similar.join(', ')}?`
     );
   }


   throw new Error(
     `Tool "${name}" not found. Available tools: ${Array.from(this.tools.keys()).join(', ')}`
   );
 }


 private findSimilar(name: string, maxDistance: number): string[] {
   const candidates = Array.from(this.tools.keys())
     .map(key => ({ key, distance: distance(key, name) }))
     .filter(item => item.distance <= maxDistance)
     .sort((a, b) => a.distance - b.distance)
     .map(item => item.key);


   return candidates;
 }
}

Lesson: Don't trust the LLM to spell tool names correctly. Validate and suggest.

4. Unvalidated Tool Arguments

What happens:
The agent calls send_email with a string instead of an object. Or passes to: undefined. Your tool crashes with cryptic errors.

Why it happens:
LLMs don't guarantee type correctness. They might:

Pass wrong types ("5" instead of 5)
Omit required fields
Add unexpected fields
Nest objects incorrectly

Real failure:

// Agent calls: send_email({ recipient: "user@example.com" })
// Tool expects: send_email({ to: string, subject: string, body: string })
// Result: TypeError: Cannot read property 'to' of undefined

The fix: Validate with Zod

import { z } from 'zod';


const sendEmailSchema = z.object({
 to: z.string().email(),
 subject: z.string().min(1).max(200),
 body: z.string(),
 cc: z.array(z.string().email()).optional(),
});


type SendEmailArgs = z.infer<typeof sendEmailSchema>;


async function sendEmail(args: unknown): Promise<string> {
 try {
   // Validate and parse
   const validated = sendEmailSchema.parse(args);


   // Now TypeScript knows the exact shape
   await emailService.send({
     to: validated.to,
     subject: validated.subject,
     body: validated.body,
     cc: validated.cc,
   });


   return 'Email sent successfully';
 } catch (error) {
   if (error instanceof z.ZodError) {
     // Return helpful error to the agent
     const issues = error.errors.map(e => `${e.path.join('.')}: ${e.message}`);
     return `Invalid arguments: ${issues.join(', ')}`;
   }
   throw error;
 }
}

Better: Validate at registration

interface Tool {
 name: string;
 description: string;
 schema: z.ZodSchema;
 handler: (args: any) => Promise<string>;
}


function createTool<T extends z.ZodSchema>(
 name: string,
 description: string,
 schema: T,
 handler: (args: z.infer<T>) => Promise<string>
): Tool {
 return {
   name,
   description,
   schema,
   handler: async (rawArgs: unknown) => {
     const validated = schema.parse(rawArgs);
     return handler(validated);
   },
 };
}


// Usage
const emailTool = createTool(
 'send_email',
 'Send an email',
 sendEmailSchema,
 async (args) => {
   // args is fully typed!
   return await emailService.send(args);
 }
);

Lesson: Validate everything that crosses the LLM boundary.

5. The Silent Timeout

What happens:
Your agent calls a slow API. The call times out. The agent doesn't know what happened and can't recover.

Why it happens:
No timeout handling. External APIs can:

Hang indefinitely
Take longer than expected
Fail without clear errors

Real scenario:

// Tool calls external API
const data = await fetch('https://slow-api.com/data');
// This might hang for minutes
// Agent has no idea what's happening

The fix: Timeout with retries

async function fetchWithTimeout(
 url: string,
 options: RequestInit = {},
 timeoutMs: number = 5000
): Promise<Response> {
 const controller = new AbortController();
 const timeout = setTimeout(() => controller.abort(), timeoutMs);


 try {
   const response = await fetch(url, {
     ...options,
     signal: controller.signal,
   });
   clearTimeout(timeout);
   return response;
 } catch (error) {
   clearTimeout(timeout);
   if (error instanceof Error && error.name === 'AbortError') {
     throw new Error(`Request timed out after ${timeoutMs}ms`);
   }
   throw error;
 }
}

Better: Retry with exponential backoff

async function fetchWithRetry(
 url: string,
 maxRetries: number = 3,
 timeoutMs: number = 5000
): Promise<Response> {
 let lastError: Error | null = null;


 for (let attempt = 0; attempt < maxRetries; attempt++) {
   try {
     return await fetchWithTimeout(url, {}, timeoutMs);
   } catch (error) {
     lastError = error as Error;

     // Don't retry on last attempt
     if (attempt < maxRetries - 1) {
       const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
       console.log(`Attempt ${attempt + 1} failed, retrying in ${delay}ms...`);
       await new Promise(resolve => setTimeout(resolve, delay));
     }
   }
 }


 throw new Error(
   `Failed after ${maxRetries} attempts: ${lastError?.message}`
 );
}

Best: Return partial results

async function searchWithTimeout(
 query: string
): Promise<{ results: string[]; completed: boolean }> {
 try {
   const response = await fetchWithTimeout(
     `https://api.search.com?q=${query}`,
     {},
     3000 // 3 second timeout
   );
   const data = await response.json();
   return { results: data.results, completed: true };
 } catch (error) {
   // Return what we have, mark as incomplete
   return {
     results: ['Search timed out. Try a more specific query.'],
     completed: false,
   };
 }
}

Lesson: External calls fail. Plan for it.

Putting It All Together

Here's a tool wrapper that prevents all five failures:

class RobustToolExecutor {
 private attempts = new Map<string, number>();
 private maxRetries = 3;
 private timeout = 5000;


 async execute(
   toolName: string,
   args: unknown,
   schema: z.ZodSchema
 ): Promise<any> {
   // Prevent #1: Infinite loops
   const attemptKey = `${toolName}:${JSON.stringify(args)}`;
   const attemptCount = this.attempts.get(attemptKey) || 0;

   if (attemptCount >= this.maxRetries) {
     throw new Error(`Tool ${toolName} failed after ${this.maxRetries} attempts`);
   }

   this.attempts.set(attemptKey, attemptCount + 1);


   // Prevent #4: Unvalidated arguments
   let validatedArgs;
   try {
     validatedArgs = schema.parse(args);
   } catch (error) {
     if (error instanceof z.ZodError) {
       throw new Error(`Invalid arguments: ${error.errors.map(e => e.message).join(', ')}`);
     }
     throw error;
   }


   // Prevent #5: Silent timeouts
   try {
     return await this.executeWithTimeout(toolName, validatedArgs);
   } catch (error) {
     // Clear attempt counter on different error types
     if (!(error instanceof Error && error.message.includes('timeout'))) {
       this.attempts.delete(attemptKey);
     }
     throw error;
   }
 }


 private async executeWithTimeout(
   toolName: string,
   args: any
 ): Promise<any> {
   return Promise.race([
     this.executeTool(toolName, args),
     new Promise((_, reject) =>
       setTimeout(
         () => reject(new Error(`Tool ${toolName} timed out after ${this.timeout}ms`)),
         this.timeout
       )
     ),
   ]);
 }


 private async executeTool(toolName: string, args: any): Promise<any> {
   // Your actual tool execution logic
   return { success: true };
 }
}

Common Theme

All five failures share a pattern: the agent assumes things will work, but they don't.

Prevention requires:

Hard limits (iterations, tokens, time)
Validation (types, schemas, existence)
Graceful degradation (timeouts, partial results)
Clear error messages (the agent needs to understand what failed)

These aren't bugs in your business logic. They're operational concerns that become visible under real usage.

Testing These Failures

You can't catch these in unit tests. They appear under load:

// Simulate failure scenarios
describe('Agent reliability', () => {
 it('stops after max iterations', async () => {
   const agent = new BudgetedAgent({ maxCalls: 5 });
   await expect(agent.run(impossibleTask)).rejects.toThrow('Max iterations');
 });


 it('handles tool timeouts', async () => {
   const slowTool = createTool('slow', 'A slow tool', z.object({}), async () => {
     await new Promise(resolve => setTimeout(resolve, 10000));
     return 'done';
   });

   await expect(executeWithTimeout(slowTool, {}, 1000)).rejects.toThrow('timed out');
 });
});

Better: Monitor these in production with metrics.

Wrapping Up

These five failures happen to everyone. The difference is whether you build safeguards before or after they burn you in production.

Start with limits and validation. Add retries and timeouts as you find slow external calls. Log everything so you know what's actually failing.

Your agent will still have bugs. But these won't be among them.

The 5 Types of AI Agent Memory Every TypeScript Developer Should Know

Raju Dandigam — Wed, 08 Jul 2026 14:29:26 +0000

Most developers try to fix AI agents with better prompts.

In practice, most agent problems are memory problems.

The agent forgets context. It repeats the same mistake. It uses outdated information. It cannot connect past actions to current decisions. Or it behaves inconsistently across sessions.

That is not a model issue. That is a memory design issue.

In real systems, memory is not one thing. It is a combination of different layers, each solving a different problem. Once you understand this, your agent design becomes much simpler and much more reliable.

Here are the five types of memory I would use in a TypeScript AI agent in 2026.

1. Short-term memory keeps the current task coherent

Short-term memory is what the agent knows right now.

It includes the current goal, recent steps, tool outputs, and intermediate results. Without it, the agent loses track of what it is doing within a single run.

Most “looping” or “confused” agent behavior comes from weak short-term memory. The model simply does not have a clear picture of what already happened.

In TypeScript, this is usually just part of your runtime state.

type AgentState = {
  goal: string;
  steps: Array<{
    name: string;
    input: unknown;
    output?: unknown;
  }>;
};

This does not need to be complicated. The important part is that every step is recorded and passed back into the next decision.
Use short-term memory when:
The agent needs to track progress within a task
Tool outputs influence the next step
You want to prevent repeated or redundant actions
If your agent keeps repeating the same tool call, this is the first place to look.

2. Semantic memory stores reusable facts

Semantic memory is long-term knowledge that does not change frequently.
This includes things like user preferences, account details, product configurations, or domain knowledge that the agent should remember across sessions.
For example:
A user prefers morning flights
A customer is on an enterprise plan
A project uses React and PostgreSQL
This type of memory is usually stored in a database or a vector store and retrieved based on relevance.

type SemanticMemory = {
  userId: string;
  key: string;
  value: string;
};

The key idea is selective retrieval. You do not want to send all stored knowledge to the model. You only want to retrieve what is relevant to the current goal.
Use semantic memory when:
The agent needs to remember user preferences
Context should persist across sessions
Decisions depend on stable facts
If your agent feels “stateless” across sessions, this is what is missing.

3. Episodic memory remembers past actions

Episodic memory is about what happened before.
This includes past agent runs, previous decisions, failures, retries, and outcomes. It is different from semantic memory because it captures events, not facts.
For example:
The agent already created a support ticket yesterday
A payment sync failed earlier and should not be retried immediately
A previous request was escalated to a human

type EpisodicMemory = {
  event: string;
  result: string;
  timestamp: string;
};

This type of memory is useful for preventing repeated mistakes and improving consistency over time.
Use episodic memory when:
You want to avoid repeating the same action
The agent needs awareness of past outcomes
Decisions depend on historical behavior
If your agent keeps retrying something that already failed, it likely lacks episodic memory.

4. Procedural memory stores how things should be done

Procedural memory is often overlooked, but it is one of the most important types.
This is not data. This is behavior.
It defines how the agent should perform tasks, what steps to follow, what rules to enforce, and what constraints to respect.
In practice, this lives in prompts, system instructions, or structured workflows.
For example:
Always validate tool input before execution
Ask for approval before sending emails
Prefer cheaper tools unless confidence is low
Stop after a fixed number of steps

const agentRules = [
  "Validate all tool inputs",
  "Do not call high-risk tools without approval",
  "Stop after 5 steps if no progress is made"
];

Use procedural memory when:
You want consistent behavior across runs
You need to enforce business rules
You want to reduce unpredictable decisions
If your agent behaves differently for similar inputs, this is usually the missing piece.

5. Audit memory records what happened

Audit memory is not used by the model directly. It is used by developers, systems, and compliance processes.
This includes logs, traces, step history, tool calls, and decision paths.

type AuditLog = {
  runId: string;
  step: string;
  input?: unknown;
  output?: unknown;
  timestamp: string;
};

This is what allows you to answer questions like:
Why did the agent choose this tool?
What input did it send?
What did the tool return?
Why did the agent stop?
Use audit memory when:
You need debugging visibility
You want to evaluate agent behavior
You have compliance or audit requirements
If you cannot explain why your agent made a decision, you are missing audit memory.

How these memory types work together

These memory types are not independent. They work together in a layered way.
Short-term memory drives the current task.
Semantic memory provides relevant facts.
Episodic memory gives historical awareness.
Procedural memory enforces behavior.
Audit memory records everything for inspection.
Most production issues happen when one of these layers is missing.
Too much focus on prompts without memory leads to inconsistent behavior.
Too much memory without structure leads to noise and confusion.
The goal is balance.

A simple mental model

If I had to simplify everything into one idea, it would be this:
Short-term memory answers: What is happening right now?
Semantic memory answers: What do we know?
Episodic memory answers: What happened before?
Procedural memory answers: How should we act?
Audit memory answers: What actually happened?
Once you design your agent with these questions in mind, most problems become easier to reason about.

The real takeaway

Most developers try to make agents smarter by improving prompts or switching models.
In real systems, the biggest improvements come from better memory design.
When memory is structured correctly, the agent becomes more consistent, more reliable, and easier to debug. It stops repeating mistakes. It adapts to users. It follows rules. And most importantly, it becomes predictable.
That is what turns an AI agent from a demo into a system you can actually trust.

Model Context Protocol (MCP) for TypeScript Developers: A 10-Minute Guide

Raju Dandigam — Tue, 07 Jul 2026 17:01:45 +0000

Anthropic released Model Context Protocol in late 2024 to solve a real problem: every AI agent needs custom code to connect to data sources. MCP standardizes this. Here's what it is and how to use it with TypeScript.

The Problem MCP Solves

Before MCP, connecting an AI agent to your database looked like this:

// Custom integration for Agent A
async function agentAGetData() {
 const db = new DatabaseClient();
 return await db.query('SELECT * FROM users');
}


// Different custom integration for Agent B
async function agentBGetData() {
 const conn = connectToDatabase();
 return conn.fetchUsers();
}

Every agent, every tool, every integration required custom code. MCP creates a standard protocol so you write the integration once and any MCP-compatible agent can use it.

Think of it like OAuth did for authentication—instead of building login flows for every app, you implement OAuth once and everyone can use it.

What is MCP?

Model Context Protocol is:

A standard way for AI agents to request data from sources
A client-server architecture where servers expose data, clients (agents) consume it
Transport agnostic - works over stdio, HTTP, WebSocket

In practice: You write an MCP server that exposes your filesystem, database, or API. Any agent that speaks MCP can then access that data without custom integration code.

When to Use MCP

Use MCP when:

Multiple agents need the same data source
You want reusable integrations
You're building tools others will use

Skip MCP when:

You have one agent with 2-3 simple tools
Custom integration is faster for your use case
You're prototyping and speed matters

MCP adds structure. That's great for maintainability, but it's overhead for simple projects.

Quick Start: Your First MCP Server

We'll build a filesystem server that lets agents read files.

Setup

mkdir mcp-filesystem-server
cd mcp-filesystem-server
npm init -y
npm install @modelcontextprotocol/sdk
npm install -D typescript @types/node tsx
npx tsc --init

Create the Server

Create src/index.ts:

import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import {
 CallToolRequestSchema,
 ListToolsRequestSchema,
} from '@modelcontextprotocol/sdk/types.js';
import fs from 'fs/promises';
import path from 'path';


// Create the MCP server
const server = new Server(
 {
   name: 'filesystem-server',
   version: '1.0.0',
 },
 {
   capabilities: {
     tools: {},
   },
 }
);


// Define what tools this server provides
server.setRequestHandler(ListToolsRequestSchema, async () => {
 return {
   tools: [
     {
       name: 'read_file',
       description: 'Read the contents of a file',
       inputSchema: {
         type: 'object',
         properties: {
           path: {
             type: 'string',
             description: 'Path to the file to read',
           },
         },
         required: ['path'],
       },
     },
     {
       name: 'list_directory',
       description: 'List files in a directory',
       inputSchema: {
         type: 'object',
         properties: {
           path: {
             type: 'string',
             description: 'Path to the directory',
           },
         },
         required: ['path'],
       },
     },
   ],
 };
});


// Handle tool execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
 const { name, arguments: args } = request.params;


 try {
   if (name === 'read_file') {
     const filePath = args.path as string;
     const content = await fs.readFile(filePath, 'utf-8');
     return {
       content: [
         {
           type: 'text',
           text: content,
         },
       ],
     };
   }


   if (name === 'list_directory') {
     const dirPath = args.path as string;
     const files = await fs.readdir(dirPath);
     const fileList = files.join('\n');
     return {
       content: [
         {
           type: 'text',
           text: `Files in ${dirPath}:\n${fileList}`,
         },
       ],
     };
   }


   return {
     content: [
       {
         type: 'text',
         text: `Unknown tool: ${name}`,
       },
     ],
     isError: true,
   };
 } catch (error) {
   return {
     content: [
       {
         type: 'text',
         text: `Error: ${error instanceof Error ? error.message : String(error)}`,
       },
     ],
     isError: true,
   };
 }
});


// Start the server
async function main() {
 const transport = new StdioServerTransport();
 await server.connect(transport);
 console.error('Filesystem MCP server running on stdio');
}


main().catch(console.error);

Build and Test

# Build
npx tsc


# Test (the server communicates over stdio)
node dist/index.js

The server is now running and waiting for MCP requests over stdin/stdout.

What Just Happened?

You created an MCP server with two capabilities:

ListTools: Tells clients what tools are available
CallTool: Executes tools when requested

When a client (like Claude Desktop) connects:

It asks: "What tools do you have?"
Server responds: "I have read_file and list_directory"
Client requests: "Run read_file with path=/example.txt"
Server executes and returns the file contents

Connecting to Claude Desktop

Claude Desktop has built-in MCP support. To use your server:

1. Create MCP Configuration

On macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
On Windows: %APPDATA%/Claude/claude_desktop_config.json

{
 "mcpServers": {
   "filesystem": {
     "command": "node",
     "args": ["/absolute/path/to/your/dist/index.js"]
   }
 }
}

2. Restart Claude Desktop

After restarting, Claude can now read files and list directories using your MCP server.

Ask Claude: "What files are in my Documents folder?" and it will use your MCP server to answer.

Pre-Built MCP Servers

You don't have to write every integration. Several official servers exist:

# Install pre-built servers
npm install -g @modelcontextprotocol/server-filesystem
npm install -g @modelcontextprotocol/server-github
npm install -g @modelcontextprotocol/server-postgres

Configure them in claude_desktop_config.json:

{
 "mcpServers": {
   "filesystem": {
     "command": "npx",
     "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/Documents"]
   },
   "github": {
     "command": "npx",
     "args": ["-y", "@modelcontextprotocol/server-github"],
     "env": {
       "GITHUB_TOKEN": "your-token-here"
     }
   }
 }
}

Now Claude can access your filesystem and GitHub without you writing any integration code.

Building a Database MCP Server

Let's build something more practical—a PostgreSQL server:

import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import {
 CallToolRequestSchema,
 ListToolsRequestSchema,
} from '@modelcontextprotocol/sdk/types.js';
import { Client } from 'pg';


const dbClient = new Client({
 connectionString: process.env.DATABASE_URL,
});


await dbClient.connect();


const server = new Server(
 {
   name: 'postgres-server',
   version: '1.0.0',
 },
 {
   capabilities: {
     tools: {},
   },
 }
);


server.setRequestHandler(ListToolsRequestSchema, async () => {
 return {
   tools: [
     {
       name: 'query',
       description: 'Execute a SQL query (SELECT only for safety)',
       inputSchema: {
         type: 'object',
         properties: {
           sql: {
             type: 'string',
             description: 'SQL query to execute',
           },
         },
         required: ['sql'],
       },
     },
   ],
 };
});


server.setRequestHandler(CallToolRequestSchema, async (request) => {
 const { name, arguments: args } = request.params;


 if (name === 'query') {
   const sql = args.sql as string;


   // Safety check - only allow SELECT
   if (!sql.trim().toLowerCase().startsWith('select')) {
     return {
       content: [
         {
           type: 'text',
           text: 'Only SELECT queries are allowed',
         },
       ],
       isError: true,
     };
   }


   try {
     const result = await dbClient.query(sql);
     return {
       content: [
         {
           type: 'text',
           text: JSON.stringify(result.rows, null, 2),
         },
       ],
     };
   } catch (error) {
     return {
       content: [
         {
           type: 'text',
           text: `Query error: ${error instanceof Error ? error.message : String(error)}`,
         },
       ],
       isError: true,
     };
   }
 }


 return {
   content: [{ type: 'text', text: `Unknown tool: ${name}` }],
   isError: true,
 };
});


const transport = new StdioServerTransport();
await server.connect(transport);

Now any MCP client can query your database safely.

MCP Client (Connecting from Your Agent)

Want to use MCP servers from your own agent? Use the client SDK:

import { Client } from '@modelcontextprotocol/sdk/client/index.js';
import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js';


// Connect to an MCP server
const transport = new StdioClientTransport({
 command: 'node',
 args: ['./path/to/mcp-server.js'],
});


const client = new Client(
 {
   name: 'my-agent',
   version: '1.0.0',
 },
 {
   capabilities: {},
 }
);


await client.connect(transport);


// List available tools
const tools = await client.listTools();
console.log('Available tools:', tools);


// Call a tool
const result = await client.callTool({
 name: 'read_file',
 arguments: {
   path: '/path/to/file.txt',
 },
});


console.log('Result:', result);

This lets you build agents that use MCP servers as tools.

Common Patterns

Adding Resources (Not Just Tools)

MCP servers can expose resources (data) and tools (actions):

import { ListResourcesRequestSchema, ReadResourceRequestSchema } from '@modelcontextprotocol/sdk/types.js';


// List available resources
server.setRequestHandler(ListResourcesRequestSchema, async () => {
 return {
   resources: [
     {
       uri: 'config://app/settings',
       name: 'App Settings',
       mimeType: 'application/json',
     },
   ],
 };
});


// Read a specific resource
server.setRequestHandler(ReadResourceRequestSchema, async (request) => {
 const { uri } = request.params;


 if (uri === 'config://app/settings') {
   const settings = { theme: 'dark', notifications: true };
   return {
     contents: [
       {
         uri,
         mimeType: 'application/json',
         text: JSON.stringify(settings, null, 2),
       },
     ],
   };
 }


 return { contents: [] };
});

Resources are for data that agents read; tools are for actions they execute.

Security Considerations

MCP servers have full system access. Consider:

Validate inputs: Don't trust agent requests blindly
Limit file access: Only expose specific directories
Read-only by default: Separate read and write tools
No credential exposure: Handle auth server-side
Rate limiting: Prevent abuse

Example safe file access:

const ALLOWED_DIR = '/safe/directory';


server.setRequestHandler(CallToolRequestSchema, async (request) => {
 if (request.params.name === 'read_file') {
   const requestedPath = request.params.arguments.path as string;
   const resolvedPath = path.resolve(ALLOWED_DIR, requestedPath);


   // Prevent directory traversal
   if (!resolvedPath.startsWith(ALLOWED_DIR)) {
     return {
       content: [{ type: 'text', text: 'Access denied' }],
       isError: true,
     };
   }


   // Safe to proceed
   const content = await fs.readFile(resolvedPath, 'utf-8');
   return { content: [{ type: 'text', text: content }] };
 }
});

When MCP Makes Sense

Good fit:

You're building tools for multiple agents
You want a reusable GitHub/database/filesystem integration
You're part of a team with many agent projects

Overkill:

Single prototype with a few custom tools
One-off integration that won't be reused
Performance-critical paths (MCP adds overhead)

MCP trades flexibility for structure. That's great when you need consistency across projects, but it's extra work for simple cases.

What's Next?

The MCP ecosystem is early but growing. More servers are being published, and frameworks are adding MCP support.

To learn more:

For your projects:

Start with pre-built servers when available
Write custom servers for your specific data sources
Share useful servers—the ecosystem benefits from contributions

Wrapping Up

MCP standardizes how AI agents access data. Instead of building custom integrations for every agent, you write an MCP server once and any compatible agent can use it.

It's useful when you have multiple agents or want reusable integrations. For simple projects, custom tools are often faster.

The TypeScript SDK makes building MCP servers straightforward. The pattern is consistent: define tools, handle requests, return results. Everything else is application-specific logic.

If you're building agent infrastructure, MCP is worth learning. If you're building a single agent, evaluate whether the standardization benefits outweigh the added complexity.