DEV Community

Cover image for AI Observability: Logging, Tracing, and Monitoring Your AI Calls
NeuroLink AI
NeuroLink AI

Posted on

AI Observability: Logging, Tracing, and Monitoring Your AI Calls

Your AI is a black box. Here's how to open it.

You deployed an AI feature. Users are complaining it's slow. Sometimes it returns garbage. You have no idea which model ran, what prompt was sent, or how many tokens it consumed. You open your cloud dashboard and see a single line item: "AI API calls — $847.23 this month."

That's the state of most AI applications in production. You are flying blind.

The good news: this is a solved problem. OpenTelemetry has defined standard semantic conventions for AI systems. Langfuse gives you a beautiful UI to inspect every trace. And NeuroLink wires all of this up automatically — zero boilerplate, one config block.

This article shows you how to go from zero observability to full tracing, monitoring, and EU AI Act-ready audit logging in under 30 minutes.


Why AI Observability Is No Longer Optional

Observability has always mattered for APIs and databases. For AI, it matters more — and for reasons beyond debugging.

Debugging: When your model returns a hallucination, you need to know the exact prompt, the provider, the model version, the temperature setting, and the full token usage. Without traces, you're guessing.

Cost management: LLM API costs are non-linear. A single misbehaving agent can consume thousands of dollars in a weekend. Token-level tracing lets you catch runaway usage before it hits your invoice.

Performance: Is your p99 latency 8 seconds because of the model, the network, your RAG pipeline, or a slow tool call? You can only answer that with distributed tracing.

Compliance: The EU AI Act's high-risk provisions go live August 2, 2026. Penalties reach €35M or 7% of global revenue. Among the requirements: maintaining auditable records of AI decision-making, human oversight procedures, and risk documentation. Auditable AI is now a regulatory requirement, not a best practice.


NeuroLink's Observability Stack

NeuroLink ships with two observability integrations out of the box:

  1. Langfuse — the leading open-source LLM observability platform, with a hosted cloud option and self-hosted Docker deployment
  2. OpenTelemetry — the CNCF standard for distributed tracing, compatible with Jaeger, Tempo, Honeycomb, Datadog, and any OTel-compatible backend

You can use one, the other, or both simultaneously. Here is the full configuration type:

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink({
  observability: {
    langfuse: {
      enabled: true,
      publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
      secretKey: process.env.LANGFUSE_SECRET_KEY!,
      environment: "production",

      // How traces are named in Langfuse
      traceNameFormat: "userId:operationName",
      // Format options:
      //   "userId:operationName"  -> "user@email.com:ai.streamText"
      //   "operationName:userId"  -> "ai.streamText:user@email.com"
      //   "operationName"         -> "ai.streamText"
      //   "userId"                -> "user@email.com"
      //   (ctx) => `[${ctx.operationName}] ${ctx.userId}`  // custom function

      autoDetectOperationName: true,

      // If your app already has an OTel setup, plug in instead of creating a new one
      useExternalTracerProvider: false,
      autoDetectExternalProvider: true,
      skipLangfuseSpanProcessor: false,
    },

    openTelemetry: {
      enabled: true,
      endpoint: "https://otel-collector.example.com",
      serviceName: "my-ai-service",
      serviceVersion: "1.0.0",
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

That's the entire setup. Every generate() call is now automatically instrumented.


Langfuse: Seeing Inside Every LLM Call

Langfuse traces show you the full lifecycle of each AI request: inputs, outputs, model selection, token counts, latency, and cost — all in a searchable UI.

Basic Setup

Sign up at langfuse.com (or self-host with Docker). Grab your public and secret keys from the project settings.

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink({
  observability: {
    langfuse: {
      enabled: true,
      publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
      secretKey: process.env.LANGFUSE_SECRET_KEY!,
      environment: "production",
      traceNameFormat: "userId:operationName",
      autoDetectOperationName: true,
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

Correlating Traces with requestId

Every AI call can carry a requestId that flows through your entire observability stack. This is the key to answering "which AI call was responsible for this user complaint?"

// requestId appears in Langfuse traces, OTel spans, and your application logs
const result = await neurolink.generate({
  input: { text: "Analyze sentiment of customer feedback" },
  provider: "anthropic",
  model: "claude-sonnet-4-6",
  requestId: "req-customer-feedback-001",  // your request correlation ID
});

// result.analytics contains per-call metrics
console.log(`Tokens used: ${result.usage?.totalTokens}`);
console.log(`Response time: ${result.responseTime}ms`);
console.log(`Provider: ${result.provider}, Model: ${result.model}`);
Enter fullscreen mode Exit fullscreen mode

In Langfuse, you can search by requestId and immediately pull up the full trace: the exact prompt, the model response, token counts, latency breakdown, and cost.

Trace Naming Strategies

The traceNameFormat option controls how traces are organized in Langfuse. For multi-tenant applications, userId:operationName groups all of a user's AI activity together. For debugging by operation type, operationName:userId lets you filter by what the AI was doing.

You can also use a custom function for complete control:

const neurolink = new NeuroLink({
  observability: {
    langfuse: {
      enabled: true,
      publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
      secretKey: process.env.LANGFUSE_SECRET_KEY!,
      // Custom trace naming for your specific context
      traceNameFormat: (ctx) => `[${ctx.operationName}] user=${ctx.userId} env=${process.env.NODE_ENV}`,
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

OpenTelemetry: Distributed Tracing with GenAI Semantic Conventions

OpenTelemetry's GenAI semantic conventions define a standard set of attributes for AI spans. NeuroLink automatically populates all of them on every call.

What Gets Traced

The following OTel attributes are captured on every generate() call:

gen_ai.system                  -> "anthropic" | "openai" | "vertex" | ...
gen_ai.request.model           -> "claude-sonnet-4-6" | "gpt-4o" | ...
gen_ai.response.model          -> the model that actually responded
gen_ai.usage.input_tokens      -> prompt token count
gen_ai.usage.output_tokens     -> completion token count
gen_ai.request.temperature     -> temperature setting
gen_ai.request.max_tokens      -> max tokens setting
ai.operationId                 -> operation identifier
ai.finishReason                -> "stop" | "length" | "tool_calls" | ...
Enter fullscreen mode Exit fullscreen mode

These are the same attributes used by Datadog, Honeycomb, and every major OTel-compatible APM. Your AI spans integrate seamlessly with your existing infrastructure traces.

Connecting to Your OTel Collector

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink({
  observability: {
    openTelemetry: {
      enabled: true,
      endpoint: "https://otel-collector.your-domain.com",
      serviceName: "ai-service",
      serviceVersion: "2.1.0",
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

For teams already running OTel in their application (common in microservice architectures), NeuroLink can plug into your existing tracer provider instead of creating a new one:

const neurolink = new NeuroLink({
  observability: {
    langfuse: {
      enabled: true,
      publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
      secretKey: process.env.LANGFUSE_SECRET_KEY!,
      // Don't create a new OTel provider — use the one already initialized
      useExternalTracerProvider: true,
      autoDetectExternalProvider: true,
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

This means your AI spans appear in the same trace as your database queries and HTTP calls — full end-to-end visibility.


Context Compaction: Observing What You Can't See

Long-running agent conversations hit context limits. When they do, something has to give. NeuroLink handles this with a 4-stage context compaction pipeline — and understanding this pipeline is critical for observability.

Stage 1: Tool Output Pruning     (no LLM call — free)
Stage 2: File Read Deduplication (no LLM call — free)
Stage 3: LLM Summarization       (LLM call — costs tokens)
Stage 4: Sliding Window Truncation (no LLM call — fallback)
Enter fullscreen mode Exit fullscreen mode

The pipeline tries the cheapest option first and escalates only when necessary. But Stage 3 — LLM summarization — is itself an AI call that consumes tokens and incurs cost. Without observability, you would not know it was happening.

With Langfuse tracing enabled, each compaction stage appears as a child span in your trace. You can see exactly when context compaction triggers, which stage ran, and how many tokens the summarization consumed.

Here is how to configure the compaction pipeline:

import { ContextCompactor } from "@juspay/neurolink";

const compactor = new ContextCompactor({
  enablePrune: true,
  enableDeduplicate: true,
  enableSummarize: true,       // Stage 3: uses an LLM — visible in traces
  enableTruncate: true,        // Stage 4: fallback, no LLM cost

  pruneProtectTokens: 40_000,    // protect the last 40k tokens from pruning
  pruneMinimumSavings: 20_000,   // only prune if it saves 20k+ tokens
  pruneProtectedTools: ["skill"],

  summarizationProvider: "vertex",
  summarizationModel: "gemini-2.5-flash",
  keepRecentRatio: 0.3,
  truncationFraction: 0.5,
});
Enter fullscreen mode Exit fullscreen mode

The choice of summarizationProvider and summarizationModel lets you route compaction calls to a cheaper model — for example, using Gemini Flash for summarization while your main agent uses Claude Sonnet. This cost optimization is visible in your Langfuse traces: you will see two different models in the same conversation trace.


HITL Audit Logging: The Compliance Layer

Human-in-the-Loop (HITL) is NeuroLink's safety system for AI agents that take real-world actions. When an agent tries to call a dangerous tool — delete, drop, truncate, kill — HITL intercepts it and waits for human approval.

HITL's audit logging is a core part of your observability stack, especially for EU AI Act compliance. Every approval, rejection, and timeout is logged with full context.

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink({
  hitl: {
    enabled: true,
    dangerousActions: ["delete", "drop", "truncate", "remove", "kill"],
    timeout: 30000,              // 30 seconds for human to respond
    allowArgumentModification: true,   // human can edit args before approving
    autoApproveOnTimeout: false,       // reject on timeout — safe default
    auditLogging: true,                // write compliance audit trail

    customRules: [
      {
        name: "production-database-rule",
        condition: (toolName, args) => {
          return toolName.includes("database") &&
                 JSON.stringify(args).includes("production");
        },
        requiresConfirmation: true,
        customMessage: "This action touches the production database!",
      },
    ],
  },
});
Enter fullscreen mode Exit fullscreen mode

The HITLAuditLog type captures everything the EU AI Act's human oversight requirements ask for:

// What the audit log captures (HITLAuditLog type):
{
  eventType:    "confirmation-request" | "confirmation-response" | "timeout",
  toolName:     string,           // which tool was called
  arguments:    unknown,          // what arguments it was called with
  approved:     boolean,          // what the human decided
  reason:       string,           // why they approved or rejected
  userId:       string,           // who made the decision
  ipAddress:    string,           // from where
  userAgent:    string,           // from which client
  responseTime: number,           // how long the human took (ms)
  timestamp:    string,           // ISO timestamp
}
Enter fullscreen mode Exit fullscreen mode

This audit log is your paper trail. When a regulator asks "did a human review this AI action?", you have a timestamped, tamper-evident record.

HITL Statistics

You can query live statistics from the HITL manager at any time:

const stats = hitlManager.getStatistics();
// {
//   totalRequests:       number,  // all-time confirmation requests
//   pendingRequests:     number,  // currently awaiting human decision
//   averageResponseTime: number,  // ms average across all decisions
//   approvedRequests:    number,
//   rejectedRequests:    number,
//   timedOutRequests:    number,
// }
Enter fullscreen mode Exit fullscreen mode

Expose this via a metrics endpoint and you have a live dashboard of how often your AI is asking for human oversight — and how quickly humans are responding.


Putting It All Together: A Production-Ready Observable AI Setup

Here is a complete production setup combining all three layers: Langfuse tracing, OTel spans, and HITL audit logging.

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink({
  // Full observability stack
  observability: {
    langfuse: {
      enabled: true,
      publicKey: process.env.LANGFUSE_PUBLIC_KEY!,
      secretKey: process.env.LANGFUSE_SECRET_KEY!,
      environment: process.env.NODE_ENV as "production" | "development",
      traceNameFormat: "userId:operationName",
      autoDetectOperationName: true,
    },
    openTelemetry: {
      enabled: true,
      endpoint: process.env.OTEL_EXPORTER_OTLP_ENDPOINT!,
      serviceName: "ai-backend",
      serviceVersion: process.env.npm_package_version ?? "unknown",
    },
  },

  // HITL with audit logging for compliance
  hitl: {
    enabled: true,
    dangerousActions: ["delete", "drop", "truncate", "remove", "kill"],
    timeout: 30000,
    allowArgumentModification: true,
    autoApproveOnTimeout: false,
    auditLogging: true,
  },
});

// Every generate() call is now fully traced
async function analyzeCustomerFeedback(
  userId: string,
  feedbackText: string,
  requestId: string
) {
  const result = await neurolink.generate({
    input: { text: `Analyze the sentiment and key themes in: ${feedbackText}` },
    provider: "anthropic",
    model: "claude-sonnet-4-6",
    requestId,  // correlates this call across Langfuse, OTel, and your own logs
  });

  // result.analytics — per-call metrics
  console.log(`Tokens: ${result.usage?.totalTokens}`);
  console.log(`Cost: $${result.analytics?.cost?.toFixed(6)}`);
  console.log(`Latency: ${result.responseTime}ms`);
  console.log(`Provider: ${result.provider}`);

  return result.content;
}
Enter fullscreen mode Exit fullscreen mode

When you call analyzeCustomerFeedback, here is what happens automatically:

  1. A trace is created in Langfuse with the name userId:ai.generate
  2. An OTel span is created with gen_ai.* attributes
  3. The span is exported to your OTel collector
  4. requestId appears in both traces, linking them
  5. If a tool call triggers HITL, the approval decision is written to the audit log
  6. result.analytics gives you the cost and token breakdown

Reading the Analytics Data

Every generate() result includes an analytics field with per-call metrics. You do not need observability enabled to access this — it is always present.

const result = await neurolink.generate({
  input: { text: "Summarize this quarterly report" },
  provider: "openai",
  model: "gpt-4o",
});

// Usage breakdown
console.log(result.usage?.inputTokens);    // prompt tokens
console.log(result.usage?.outputTokens);   // completion tokens
console.log(result.usage?.totalTokens);    // sum

// Performance
console.log(result.responseTime);          // milliseconds

// Which provider/model actually ran
console.log(result.provider);             // "openai"
console.log(result.model);               // "gpt-4o"

// Tools called (if any)
console.log(result.toolsUsed);           // ["search", "read_file"]

// Cost (if analytics enabled — it is by default)
console.log(result.analytics?.cost);     // USD float
Enter fullscreen mode Exit fullscreen mode

Combine this with requestId correlation and you can build a complete picture: which user triggered which AI call, what it cost, how long it took, and which tools it used — all from your own application logs, without needing to open Langfuse.


EU AI Act Compliance Checklist

With NeuroLink's observability stack fully configured, here is what you can demonstrate to an auditor:

Requirement How NeuroLink Covers It
Audit trail for AI decisions HITL HITLAuditLog with userId, timestamp, decision, responseTime
Human oversight records HITL approval/rejection events with allowArgumentModification
AI system inventory result.provider + result.model in every trace gives you a live model inventory
Input/output logging Langfuse traces capture full prompt and response
Performance monitoring OTel spans + result.responseTime per call
Cost and usage tracking result.analytics?.cost + result.usage per call
Risk documentation HITLStatistics gives aggregate oversight metrics

The August 2026 deadline is months away. The teams scrambling to retrofit compliance into their AI stack are the ones who did not build with an observable SDK from the start.


What's Next

You have seen how NeuroLink turns your AI calls from a black box into a fully observable system. Every call is traced. Every cost is captured. Every human oversight decision is audited.

Try it yourself:

npm install @juspay/neurolink
Enter fullscreen mode Exit fullscreen mode

Then sign up for a free Langfuse account, drop in your keys, and run your first traced AI call in under 5 minutes.

Resources:


This is part of the NeuroLink AI Development series. Previous articles covered HITL safety systems, RAG pipelines, MCP in production, and multi-model workflow engines.

Top comments (1)

Collapse
 
c_nguynnh_56de361f0 profile image
Đức Nguyễn Đình

Quick personal review of AhaChat after trying it
I recently tried AhaChat to set up a chatbot for a small Facebook page I manage, so I thought I’d share my experience.
I don’t have any coding background, so ease of use was important for me. The drag-and-drop interface was pretty straightforward, and creating simple automated reply flows wasn’t too complicated. I mainly used it to handle repetitive questions like pricing, shipping fees, and business hours, which saved me a decent amount of time.
I also tested a basic flow to collect customer info (name + phone number). It worked fine, and everything is set up with simple “if–then” logic rather than actual coding.
It’s not an advanced AI that understands everything automatically — it’s more of a rule-based chatbot where you design the conversation flow yourself. But for basic automation and reducing manual replies, it does the job.
Overall thoughts:
Good for small businesses or beginners
Easy to set up
No technical skills required
I’m not affiliated with them — just sharing in case someone is looking into chatbot tools for simple automation.
Curious if anyone else here has tried it or similar platforms — what was your experience?