Armel BOBDA

Posted on Feb 20

Deep Dive: Building Observable AI with Opik

#opik #ai #vercel #nextjs

How Pause turns every AI decision into a traceable, scoreable, searchable artifact.

The Problem

Most AI applications treat observability as an afterthought — a console.log here, a dashboard metric there. When something goes wrong, you're left staring at opaque LLM calls with no idea why the model said what it said.

Pause is an AI-powered financial guardian that intercepts impulse purchases with personalized interventions. Every interaction involves risk assessment, strategy selection, coupon discovery, behavioral reflection, and user feedback — a multi-step pipeline where any stage can fail silently. Without deep observability, "self-improving AI" is just a marketing claim.

We needed Opik to make the intelligence visible, auditable, and measurable. This deep dive documents our approach for the Best Use of Opik category.

Architecture Overview

Link: mermaid.live

Key insight: Traces aren't static log entries. They're living documents that get richer over time as users make decisions and the learning pipeline processes outcomes.

Pattern 1: Automatic Tracing via Vercel AI SDK

Every Guardian streamText() call is automatically traced through the OpenTelemetry pipeline. No manual instrumentation needed.

Setup (instrumentation.ts):

export async function register() {
  if (process.env.OPIK_API_KEY) {
    const { OpikExporter } = await import("opik-vercel");
    registerOTel({
      serviceName: "pause-guardian",
      traceExporter: new OpikExporter({
        tags: ["hackathon", "pause"],
      }),
    });
  }
}

Usage in the Guardian route (route.ts):

const result = streamText({
  model,
  system: systemPrompt,
  messages,
  tools,
  experimental_telemetry: getGuardianTelemetry(
    interactionId,
    { score: riskResult.score, reasoning: riskResult.reasoning },
    tier,
    isAutoApproved,
    undefined,
    undefined,
    purchaseContext,
    prediction
  ),
});

The getGuardianTelemetry() function is the centralized point where every trace gets its name, metadata, and identity. This prevents the "unnamed ai.generateText" traces that make Opik dashboards unusable.

Pattern 2: 16 Named Trace Types — A Taxonomy, Not Just Labels

We defined 16 distinct trace names that map to every possible Guardian outcome:

export const TRACE_NAMES = {
  ANALYST_AUTO_APPROVED:       "guardian:analyst:auto_approved",
  NEGOTIATOR_ACCEPTED_SAVINGS: "guardian:negotiator:accepted_savings",
  NEGOTIATOR_SKIPPED_SAVINGS:  "guardian:negotiator:skipped_savings",
  NEGOTIATOR_ACCEPTED:         "guardian:negotiator:accepted",
  NEGOTIATOR_OVERRIDE:         "guardian:negotiator:override",
  THERAPIST_WAIT:              "guardian:therapist:wait",
  THERAPIST_ACCEPTED:          "guardian:therapist:accepted",
  THERAPIST_OVERRIDE:          "guardian:therapist:override",
  THERAPIST_WIZARD_BOOKMARK:   "guardian:therapist:wizard_bookmark",
  THERAPIST_WIZARD_ABANDONED:  "guardian:therapist:wizard_abandoned",
  BREAK_GLASS:                 "guardian:break_glass",
  SYSTEM_FAILURE_ANALYST_ONLY: "system:failure:analyst_only",
  SYSTEM_FAILURE_BREAK_GLASS:  "system:failure:break_glass",
  LEARNING_REFLECTION:         "learning:reflection",
  LEARNING_SKILLBOOK_UPDATE:   "learning:skillbook_update",
  CHAT_KNOWLEDGE:              "chat:knowledge",
} as const satisfies Record<string, TraceName>;

Why this matters: A judge (or developer) can filter Opik by guardian:negotiator:override and immediately see every time a user rejected the savings offer. Filter by learning:reflection to see every learning cycle. The namespace convention (guardian:, learning:, system:) creates natural groupings.

The resolveTraceName() function dynamically selects the right name based on tier, outcome, and degradation state — so traces are always semantically accurate, even during partial failures.

Pattern 3: The Child Trace Pattern — Working Around Immutability

The problem: Opik traces are immutable after creation. But we don't know the Guardian's reasoning summary until after streaming completes. The trace is already created by the time we have the data we need.

The solution: Create a child trace linked via interactionId metadata:

export async function writeTraceMetadata(
  interactionId: string,
  metadata: Record<string, unknown>
): Promise<void> {
  const client = getOpikClient();
  if (!client) return;

  // Find the parent trace by interactionId
  const traces = await client.searchTraces({
    filterString: `metadata.interactionId = "${interactionId}"`,
    maxResults: 1,
  });

  if (traces.length > 0) {
    // Create a child trace with the metadata
    const metadataTrace = client.trace({
      name: "guardian:metadata_update",
      input: {
        interactionId,
        parentTraceId: traces[0].id,
        ...metadata,
      },
    });
    metadataTrace.end();
    await client.flush();
  }
}

This pattern is used for:

Reasoning summaries — written after streaming completes
Reflection results — written after the learning pipeline runs
Skillbook snapshots — written after skills are updated
Satisfaction feedback — written days later from Ghost Cards

Each child trace is a timestamped addition to the interaction's story.

Pattern 4: Feedback Scores — Turning User Decisions into Metrics

Every user decision is converted to a numerical score and attached to the original Guardian trace:

export const INTERVENTION_ACCEPTANCE_SCORES: Record<
  string,
  { value: number; reason: string }
> = {
  accepted:        { value: 1.0, reason: "User accepted Guardian suggestion" },
  accepted_savings:{ value: 1.0, reason: "User accepted savings offer" },
  wait:            { value: 1.0, reason: "User chose to wait as suggested" },
  skipped_savings: { value: 0.5, reason: "User skipped savings but accepted unlock" },
  override:        { value: 0.0, reason: "User overrode Guardian intervention" },
  wizard_bookmark: { value: 1.0, reason: "User engaged deeply with reflection wizard" },
};

And a second score layer from retrospective Ghost Card feedback:

export const REGRET_FREE_SCORES: Record<
  string,
  { value: number; reason: string } | null
> = {
  worth_it:  { value: 1.0, reason: "User reports purchase was worth it" },
  regret_it: { value: 0.0, reason: "User reports regret about purchase" },
  not_sure:  null,  // No score — insufficient signal
};

The attachment mechanism searches for the original trace by interactionId, then logs the score:

client.logTracesFeedbackScores([{
  id: traceId,
  name: scoreName,
  value: scoreValue,
  ...(reason && { reason }),
}]);

Why this matters: A single trace in Opik now carries the full lifecycle — from initial risk assessment, through the intervention strategy, to the user's immediate decision, and finally their retrospective satisfaction days later. You can filter for intervention_acceptance = 0.0 to find every failed intervention and trace exactly why the strategy didn't work.

Pattern 5: Learning Pipeline Traces — Observing Self-Improvement

When the ACE learning pipeline runs, each stage creates its own trace:

Reflection trace — what the Reflector learned:

client.trace({
  name: "learning:reflection",
  input: {
    interactionId,
    parentTraceId: traces[0].id,
    reflectionAnalysis: reflectionOutput.analysis,
    helpfulSkillIds: reflectionOutput.helpful_skill_ids,
    harmfulSkillIds: reflectionOutput.harmful_skill_ids,
    newLearningsCount: reflectionOutput.new_learnings.length,
  },
  tags: ["learning", "reflection"],
});

Skillbook update trace — what changed in the knowledge base:

client.trace({
  name: "learning:skillbook_update",
  input: {
    interactionId,
    operationCount: updateBatch.operations.length,
    skillCountBefore,
    skillCountAfter,
    delta: skillCountAfter - skillCountBefore,
    operationsByType,
    reasoning: updateBatch.reasoning,
    operations: updateBatch.operations.map((op) => ({
      type: op.type,
      section: op.section,
      skill_id: op.skill_id,
    })),
    skillbook_snapshot: buildSkillbookSnapshot(skillbook),
  },
  tags: ["learning", "skillbook_update"],
});

Why this matters for judges: You can open Opik, filter to learning:skillbook_update, and watch the Skillbook grow. Each trace shows the delta — "3 skills before, 5 after, +2 added." This is the tangible proof that the AI is learning, not just a before/after screenshot.

Pattern 6: Degradation Tracing — Even Failures Are Observable

When the Guardian hits a service failure, the degradation itself is traced:

export async function logDegradationTrace(
  interactionId: string,
  degradationLevel: "analyst_only" | "break_glass",
  failureReason: string,
  riskMeta?: { score: number; reasoning: string },
): Promise<void> {
  const client = getOpikClient();
  if (!client) return;

  const trace = client.trace({
    name: degradationLevel === "analyst_only"
      ? TRACE_NAMES.SYSTEM_FAILURE_ANALYST_ONLY
      : TRACE_NAMES.SYSTEM_FAILURE_BREAK_GLASS,
    input: {
      interactionId,
      failureReason,
      degraded: true,
      degradationLevel,
      reasoning_summary: buildReasoningSummary({ ... }),
    },
  });
  trace.end();
  await client.flush();
}

System failures get their own namespace (system:failure:*) so they never pollute behavioral data. You can filter Opik to see only degradation events and understand system reliability independently from AI quality.

The Fire-and-Forget Philosophy

Every Opik operation follows one rule: telemetry failures must never disrupt the user.

// Every Opik call is wrapped like this:
try {
  const client = getOpikClient();
  if (!client) return;  // Gracefully disabled in dev

  // ... trace operations ...
  await client.flush();
} catch {
  // Telemetry failures must never disrupt the main flow
}

This means:

No Opik API key? App works normally, traces silently skipped
Opik API down? User flow completes, traces lost (acceptable trade-off)
Flush fails? Background retry via batch queue, user unaware

The getOpikClient() singleton returns null when OPIK_API_KEY isn't set, so every call site naturally degrades.

Reasoning Summary Sanitization

Every trace includes a human-readable reasoning_summary field. But since Pause deals with financial behavior, we sanitize clinical terminology:

const BANNED_SUMMARY_TERMS: Record<string, string> = {
  therapy: "reflection",
  therapist: "high-risk",
  diagnosis: "assessment",
  patient: "user",
  treatment: "approach",
  clinical: "structured",
  session: "interaction",
};

Summaries are capped at 200 characters with word-boundary truncation:

if (summary.length > MAX_SUMMARY_LENGTH) {
  const spaceIdx = summary.lastIndexOf(" ", MAX_SUMMARY_LENGTH);
  summary = `${summary.slice(0, spaceIdx === -1 ? MAX_SUMMARY_LENGTH : spaceIdx)}...`;
}

This ensures Opik traces are always professional and audit-safe — no accidental clinical language in the observability layer.

What We Learned

1. Trace immutability changes your architecture

Opik traces are write-once. We expected to update traces as the interaction progressed. Instead, we developed the child trace pattern with interactionId linkage. This turned out to be better than mutation — each child trace is a timestamped event, creating a natural timeline.

2. Name your traces from Day 1

We wired getGuardianTelemetry() in the first sprint. Every feature added afterward was automatically traced with a meaningful name. Projects that retrofit tracing end up with a mix of ai.generateText.123abc and proper names — unusable for analysis.

3. Feedback scores are the killer feature

Raw traces show what the AI did. Feedback scores show whether it worked. The ability to filter Opik for intervention_acceptance = 0.0 and see exactly why users rejected interventions is what makes observability actionable, not just informational.

4. Fire-and-forget is non-negotiable

We chose Node runtime over Edge specifically because after() callbacks are more reliable on Node. Opik trace fidelity is worth the marginally slower cold starts. But even with Node, wrapping everything in try-catch ensures a flaky network never blocks a user from unlocking their card.

5. Separate behavioral data from system health

Using system:failure:* vs guardian:* namespaces means you can analyze AI quality and system reliability independently. A spike in system:failure:break_glass traces means infrastructure problems, not bad AI strategies.

The Opik + ACE Connection

Opik doesn't just observe the AI — it observes the AI learning. The learning:reflection and learning:skillbook_update traces create a visible paper trail from "user rejected strategy A" → "Reflector analyzed why" → "SkillManager added strategy B" → "next interaction used strategy B successfully."

This is the bridge to the ACE Self-Learning Deep Dive — where we explain how the Skillbook grows from seed strategies into a personalized knowledge base, with every step traced in Opik.

File Reference

File	Role
`apps/web/instrumentation.ts`	Global OTel + OpikExporter setup
`apps/web/src/lib/server/opik.ts`	Client singleton, telemetry builder, trace operations, feedback scores
`apps/web/src/lib/guardian/trace-names.ts`	16 named trace types with TypeScript const assertion
`apps/web/src/lib/server/learning.ts`	Learning pipeline with reflection + skillbook update traces
`apps/web/src/app/api/ai/guardian/route.ts`	Guardian streaming with `experimental_telemetry` injection
`apps/web/src/app/api/ai/feedback/route.ts`	Feedback score attachment to traces
`packages/ace/src/observability/opik_integration.ts`	ACE framework's own Opik integration layer

Built with Next.js 16 + Opik by Comet ML + Vercel AI SDK v6 + ACE Framework

DEV Community