DEV Community: grepture

LLM Observability Tools Compared: The 2026 Landscape

grepture — Sat, 02 May 2026 18:47:15 +0000

The LLM observability category is fragmented

Search for "LLM observability" today and you'll get results from eight tools that do subtly different things. One is a tracing SDK you wire into your app. Another is a reverse proxy that logs every request. A third is an evals platform that happens to include tracing. A fourth is an enterprise ML monitoring product that added LLM support last year.

They all claim the same keywords — tracing, observability, logging, cost tracking — but their architectures, data models, and strengths diverge significantly. Picking the wrong one costs you weeks of integration work and, worse, leaves blind spots in production.

This post is the map we wish we'd had when we started building Grepture. We'll cover the eight tools most teams evaluate in 2026, how they actually differ, and when to pick each. We build a tool in this space, so we'll flag that clearly — but the bulk of this post is about the other seven, because you need that context first.

What you're actually evaluating

Before the tool-by-tool walkthrough, here are the five dimensions that matter.

Architecture. Is it a proxy (requests flow through it), an SDK (you instrument your code), or both? Proxies give you coverage without code changes but add a network hop. SDKs are zero-latency but require integration in every service.

Data captured by default. Some tools log full prompts and completions. Others capture only metadata (tokens, latency, errors). This matters for privacy — if your prompts contain PII, a default-log-everything tool creates a compliance liability you probably didn't plan for.

Evals vs. monitoring orientation. Some platforms are built around experiments and LLM-as-judge evals; observability is secondary. Others are production-monitoring first with evals bolted on.

Cost tracking granularity. Token counts are table stakes. The real question is: can you attribute spend to a team, a feature, an environment, or a user? And can you set budget alerts before the CFO notices?

Deployment model. Open source self-host, cloud, or both? This is usually a compliance question, not a cost question. EU-regulated teams often need self-host; US startups rarely do.

The eight tools

1. Langfuse

Langfuse is the most widely deployed open-source LLM observability platform. It's MIT-licensed, self-hostable, and has a generous cloud free tier.

Architecture: SDK-based tracing. Not a proxy.
Strengths: Open source with active community. Rich tracing model. Built-in prompt management and evals. Self-host is genuinely usable (needs PostgreSQL, ClickHouse, Redis, blob storage).
Weaknesses: Instrumentation burden in every service. No gateway features.
Pick if: You want open-source, can instrument code, don't need a proxy.

2. Helicone

Helicone is the clearest example of "observability as a proxy." Change your OpenAI base URL to Helicone's endpoint and every request gets logged.

Architecture: HTTP proxy, primarily. Async logging mode also available.
Strengths: Zero-code integration. Strong cost tracking and user-level attribution. Caching built in.
Weaknesses: Proxy adds a network hop. Basic evals and prompt management.
Pick if: You want fastest integration and are comfortable with a proxy in the request path.

3. Arize (Phoenix + AX)

Arize comes from traditional ML observability and extended into LLMs. Phoenix is the open-source tracing library; Arize AX is the paid enterprise platform.

Architecture: OpenTelemetry-based SDK.
Strengths: Deep eval and drift-detection heritage. Best if you also monitor traditional ML models. OTel plays nicely with existing observability stacks.
Weaknesses: Enterprise-oriented pricing. Overkill for most startups.
Pick if: You're a larger org already running ML in production.

4. Braintrust

Braintrust is evals-first. Observability is there, but the product is organized around experiments, scoring, and iterating on prompts.

Architecture: SDK + strong web UI for evals.
Strengths: Best eval workflow on this list, by a wide margin. Playground, datasets, and LLM-as-judge scoring tightly integrated.
Weaknesses: More product than you need for pure monitoring. Closed source, cloud only.
Pick if: Your team iterates heavily on prompts and evals.

5. Lunary

Lunary (formerly LLMonitor) is a lightweight open-source platform aimed at indie devs and small teams.

Architecture: SDK-based tracing, also offers a proxy mode.
Strengths: Simple setup, clean UI, open source. Decent cost tracking.
Weaknesses: Smaller team and ecosystem than Langfuse. Basic evals.
Pick if: You're a small team and Langfuse feels heavy.

6. Humanloop

Humanloop leans into prompt management and evaluation more than pure observability.

Architecture: SDK-based with strong prompt versioning.
Strengths: Excellent prompt-management story — versioning, deployment, non-engineer collaboration.
Weaknesses: Observability is secondary. Closed source, enterprise pricing.
Pick if: Prompt management and non-engineer collaboration are your primary pain points.

7. LangSmith

LangSmith is LangChain's official observability and eval platform.

Architecture: SDK-based tracing, tightly integrated with LangChain primitives.
Strengths: Zero-friction if you're already in LangChain. Deep agent, tool call, and chain run support.
Weaknesses: Feels bolted-on if you're using raw SDKs. Closed source.
Pick if: You're committed to LangChain/LangGraph.

8. Grepture

Disclosure: this is us. Grepture started as a content-aware AI gateway with PII redaction and expanded into full observability.

Architecture: Proxy + SDK. Trace-only mode for zero latency overhead, full gateway mode when you need routing or redaction.
Strengths: Observability + AI gateway + PII redaction in one. Multi-provider routing and fallback. EU-hosted with GDPR defaults.
Weaknesses: Smaller eval workflow than Braintrust. Younger product than Langfuse or Helicone.
Pick if: You want observability + PII handling + cost tracking + multi-provider routing in one tool. Especially if you're EU-based.

Side-by-side comparison

Tool	Architecture	Open source	Evals	Gateway features	Cost tracking	Best for
Langfuse	SDK	Yes (MIT)	Strong	No	Good	Open-source tracing
Helicone	Proxy	Yes	Basic	Partial	Strong	Fastest integration
Arize	SDK (OTel)	Partial (Phoenix)	Strong	No	Good	Enterprise ML + LLM
Braintrust	SDK	No	Best-in-class	No	Basic	Eval-heavy workflows
Lunary	SDK + proxy	Yes	Basic	Limited	Good	Small teams
Humanloop	SDK	No	Strong	No	Good	Prompt-first teams
LangSmith	SDK	No	Strong	No	Good	LangChain users
Grepture	Proxy + SDK	No	Production-focused	Full	Strong	Obs + gateway + PII

How to decide

Start with integration constraint. Can't touch every service? You need a proxy — Helicone, Lunary (proxy mode), or Grepture. Can instrument? Everything else opens up.

Then filter on evals vs. monitoring. Daily prompt iteration → Braintrust or Humanloop. Watching production → Langfuse, Helicone, or Grepture.

Then compliance. Self-host or EU residency → Langfuse, Phoenix, Lunary, Grepture EU.

Finally, scope creep. Single-purpose obs tools tend to expand. If you know you'll need gateway, evals, and prompt management later, pick something that already has them.

Key takeaways

"LLM observability" is fragmented — eight leading tools, four architectures, overlapping but distinct strengths.
Biggest fork is proxy vs. SDK. Pick based on whether you can instrument every service.
Evals and observability are converging. Eval tools now trace; tracing tools now eval.
Default data capture varies — if your prompts contain PII, check what the tool logs before integrating.
Pure observability is solved. The interesting question is whether you want it stitched together with gateway, prompt management, and redaction — or as separate products.

Securing MCP Connections Through Your AI Gateway

grepture — Thu, 02 Apr 2026 04:20:34 +0000

MCP Gives Agents the Keys — Who's Watching the Door?

The Model Context Protocol (MCP) is rapidly becoming the standard way AI agents interact with external tools — databases, file systems, APIs, code repositories. Instead of hardcoding integrations, developers expose MCP servers that agents discover and call dynamically.

This is powerful. It's also a massive expansion of your attack surface.

MCP effectively gives an AI model the ability to read files, query databases, make HTTP requests, and execute code — all based on instructions it receives in its context window. If an attacker can influence that context, they can influence what the agent does with your tools.

Most MCP security guidance focuses on building secure servers. That's important, but it's only half the picture. If your team consumes third-party MCP servers — or even internal ones you didn't write — you need security at the point where traffic flows: the gateway.

The MCP Attack Surface

Before diving into defenses, let's map the threats. MCP introduces four categories of risk that didn't exist with traditional API calls:

Tool poisoning — A malicious MCP server can advertise tools with deceptive descriptions. The tool named read_file might actually exfiltrate data to an external endpoint. Since the model selects tools based on their descriptions, a poisoned description can redirect agent behavior without any visible change to the user.

Data exfiltration via tool outputs — An MCP server returns data to the model as tool results. If the server has access to sensitive systems (databases, internal APIs), it can surface PII, credentials, or proprietary data into the model's context — where it may leak into logs, responses, or downstream tool calls.

Prompt injection through tool descriptions — MCP tool descriptions are included in the model's system context. An attacker who controls a tool description can inject instructions that override the user's intent. This is indirect prompt injection applied to tool metadata.

Over-permissive server configurations — MCP servers often expose more capabilities than needed. A file system server might grant read/write access to the entire disk when the agent only needs one directory. There's no built-in permission model in MCP itself.

Why Server-Side Security Isn't Enough

OWASP published a practical guide for secure MCP server development that covers input validation, output sanitization, and least-privilege configurations. It's solid guidance — for server authors.

But here's the thing: most teams aren't writing their own MCP servers. They're consuming them. Community-built servers for GitHub, Slack, Jira, databases, and file systems are being plugged into agent workflows with minimal review. You're trusting that every server you connect:

Validates its inputs correctly
Doesn't leak sensitive data in tool results
Has descriptions that accurately reflect behavior
Doesn't phone home with your data

That's a lot of trust. And even for internal servers you do control, there's no centralized visibility into what's actually flowing through MCP connections at runtime.

This is the same problem that API gateways solved for microservices a decade ago: you need a chokepoint where you can inspect, log, and enforce policy on all traffic — regardless of what's on either end.

Gateway-Level MCP Security

An AI gateway sitting between your agent and its MCP servers can provide controls that neither the client nor the server can enforce alone:

Inspect tool call arguments — Before a tool call reaches the MCP server, the gateway can scan arguments for PII (names, emails, credit card numbers, API keys) and either redact them or block the call entirely. This prevents your agent from accidentally sending customer data to a third-party tool.

Audit all MCP traffic — Every tool call, every result, every error — logged with full context including the trace ID, the originating prompt, and the model's reasoning. This creates the audit trail that compliance teams need and that MCP doesn't provide natively.

Detect and block injection — If the gateway detects prompt injection patterns in tool descriptions or tool results, it can block the response before it reaches the model. This is the critical difference between logging an attack and preventing it.

Rate-limit tool calls — An agent stuck in a loop can burn through API quotas and rack up costs. Gateway-level rate limiting per tool, per server, or per trace prevents runaway agents from causing damage.

Enforce allowlists — Only permit tool calls to approved MCP servers. If a poisoned tool description tries to redirect the agent to an unauthorized endpoint, the gateway blocks it.

Using Evals to Understand MCP Tool Usage

Logging tells you what happened. Evals tell you whether it was the right thing.

When MCP traffic flows through your gateway, you can run LLM-as-a-judge evaluations on tool call patterns to answer questions that logs alone can't:

Which tools is the model actually calling? — Track tool call distribution across your MCP servers. If a model suddenly starts calling a tool it's never used before, that's worth investigating.

Are tool calls relevant to the user's request? — An eval can score whether each tool call was necessary and appropriate given the original prompt. A low relevance score might indicate the model is being manipulated via indirect injection or is simply confused.

Is the model leaking data across tool calls? — Evaluate whether sensitive information from one tool's output is being passed into another tool's input. This catches data exfiltration patterns that per-call inspection might miss.

Quality scoring for tool results — Not all MCP servers are equal. Eval scores on tool result quality help you identify servers that return noisy, incomplete, or misleading data — before your users notice.

Running evals on production MCP traffic turns your gateway from a passive observer into an active quality and security monitor.

Inspect, Audit, and Block — Not Just Log

Most observability tools treat MCP traffic as just another set of log entries. That's not enough when your agent has write access to production systems.

The security model for MCP needs three layers:

Layer 1: Real-time inspection — Every tool call is scanned in-flight. PII detection runs on arguments and results. Injection patterns are matched against tool descriptions and outputs. This happens synchronously, before the data reaches its destination.

Layer 2: Active blocking — When inspection finds a threat, the gateway doesn't just flag it — it blocks the call. The model receives an error response, the trace records the blocked call with the reason, and an alert fires. This is the difference between "we detected an injection attempt in our logs" and "we stopped an injection attempt before it executed."

Layer 3: Continuous evaluation — Evals run asynchronously on completed traces, catching patterns that real-time inspection can't — like gradually escalating privilege across a chain of tool calls, or a model being slowly steered toward a specific tool by repeated subtle injections.

// Example: MCP tool call flowing through a gateway
// The gateway inspects, logs, and can block at each step

const response = await fetch("https://proxy.grepture.com/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": "Bearer gpt_your_key",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "claude-sonnet-4-5-20250514",
    messages: [
      { role: "user", content: "Summarize the Q1 sales report" }
    ],
    tools: [
      {
        type: "function",
        function: {
          name: "read_document",
          description: "Read a document from the company drive",
          parameters: {
            type: "object",
            properties: {
              path: { type: "string" }
            }
          }
        }
      }
    ]
  }),
});

// The gateway:
// 1. Logs the full tool call chain in a trace
// 2. Scans tool arguments for PII/secrets before forwarding
// 3. Checks tool descriptions for injection patterns
// 4. Blocks the call if a threat is detected
// 5. Runs async evals on tool call relevance and data flow

How Grepture Helps

Grepture sits in the request path as an AI gateway — which means MCP tool calls that flow through your LLM API already pass through Grepture. Here's what you get out of the box:

Full trace visibility — Every tool call appears in the trace waterfall, showing the complete chain of tool invocations with timing, arguments, and results. You can see exactly what your agent did and in what order.

PII detection on tool traffic — Grepture's detection rules run on tool call arguments and results, catching sensitive data before it leaves your infrastructure or enters your model's context. Over 50 built-in PII patterns, plus custom rules.

Injection detection and blocking — Prompt injection detection applies to the full request context, including tool descriptions and results. When an injection is detected, Grepture can block the request and log the attempt.

Evals on tool call patterns — Run evaluators on your MCP traffic to score tool call relevance, detect anomalous patterns, and track quality over time. Custom eval prompts let you define domain-specific quality criteria for your agent's tool usage.

Cost and usage tracking — Track token usage and cost per trace, so you know exactly how much each MCP-powered workflow costs — including the overhead of tool call chains.

Key Takeaways

MCP security doesn't stop at the server. If you consume MCP servers you didn't write, you need visibility and control at the gateway layer.
Inspect and block, don't just log. Real-time PII scanning and injection detection on tool call traffic prevents attacks instead of documenting them.
Evals add the "why" layer. Logging shows what tools were called; evals reveal whether those calls were appropriate, relevant, and safe.
Treat MCP like any other API surface. Gateway-level controls (rate limiting, allowlists, audit trails) are the same patterns that secured microservices — applied to AI agent workflows.
The August 2026 EU AI Act deadline makes this urgent. Article 14 requires human oversight of high-risk AI systems. An unmonitored agent with MCP tool access is the opposite of oversight.

LLM Evals on Real Traffic — Not Just Test Suites

grepture — Sat, 21 Mar 2026 21:24:25 +0000

The eval gap

Most teams know they should be evaluating their LLM outputs. Few actually do it in production.

The typical setup looks like this: you build a test suite with a handful of golden examples, run it in CI before deploys, and hope those examples are representative of what real users actually send. Sometimes they are. Often they're not. The prompts users write in production are messier, longer, and weirder than anything in your test fixtures. The edge cases that matter most are the ones you didn't think to include.

Meanwhile, the interesting data — the actual requests and responses flowing through your AI pipeline every day — sits in logs that nobody looks at until something breaks.

We think evals should run where the data already is.

Evals on production traffic

At Grepture, we built an AI gateway that sits in the request path of every LLM call — handling PII redaction, prompt management, cost tracking, and observability. That means every request and response is already logged with full context.

Starting today, Grepture can automatically evaluate that production traffic using LLM-as-a-judge scoring. You create an evaluator — from a template or with a custom judge prompt — tell it which traffic to score, and it runs in the background against your real logs. Each response gets a 0-to-1 score with written reasoning.

No synthetic datasets. No separate evaluation pipeline. No batch jobs to manage. Your production traffic is the test suite.

Setting up an evaluator

Evaluators are judge prompts with variables. At minimum, you need {{output}} — the LLM's response. You can also use {{input}} (the user's message) and {{system}} (the system prompt) for more context-aware scoring.

We ship six templates to get you started:

Relevance — does the response actually address the question?
Helpfulness — is the response actionable and useful?
Toxicity — is the response safe and appropriate?
Conciseness — does the response convey information efficiently?
Instruction following — does the response honour what the system prompt asked for?
Hallucination — is the response grounded in what was provided?

Pick a template, adjust the prompt if you want, and enable it. Each evaluator also supports filters — only score traffic from a specific model, provider, or prompt ID — and a sampling rate so you control how much you spend on judge tokens.

Why production traffic matters

Here's what you learn from evaluating real traffic that you can't learn from a test suite:

Distribution shifts. Your test suite reflects what you thought users would ask when you wrote it. Production traffic reflects what they actually ask today. When user behaviour changes — and it always does — evals on real traffic catch it.

Long-tail failures. The requests that cause the worst outputs are usually the ones nobody anticipated. A 5% hallucination rate across your test suite might hide a 40% hallucination rate on a specific class of user query you never tested for.

Model regressions. Providers update models without notice. A minor version bump to GPT-4o or Claude might improve average quality but degrade performance on your specific use case. If you're only testing pre-deploy, you won't catch regressions introduced by the model provider.

Prompt drift. If you're managing prompts separately from code (and you should be), every prompt change is a potential quality change. Evals on real traffic give you a continuous quality signal that follows prompt versions automatically.

Controlling evals

Running a judge LLM on every request can be overkill. Two levers help:

Sampling rate. Set each evaluator to score 10% of matching traffic and you get statistically meaningful quality signals at a tenth of the cost. For high-volume workloads, even 1-5% gives you enough data to spot trends.

Filters. Only evaluate what matters. Score production traffic but skip development requests. Evaluate only your customer-facing model. Focus on a specific prompt you're actively iterating on.

Why the gateway is the right place for this

Other evaluation tools require you to export logs, set up a separate pipeline, and manage another integration. That works, but it's friction — and friction means most teams never get around to it.

When your gateway already has every request and response logged with full context, evaluation is a natural extension. No data to export, no pipeline to build, no integration to maintain.

What's coming next

Evals today give you a quality score in the dashboard. We're building toward evals that actively tell you when something goes wrong:

Email and Slack alerts when average scores drop below a threshold
Webhook integrations to pipe results into your existing monitoring
Scheduled reports with weekly quality digests

The goal: quality monitoring as hands-off as the rest of your AI infrastructure.

If you're building with LLMs and want continuous quality visibility on your production traffic, give Grepture a try. Drop in the SDK, point your traffic through the proxy, and you'll have both cost visibility and quality scoring from day one.

Stop Leaking PII Through Your OpenAI API Calls

grepture — Thu, 05 Mar 2026 19:48:10 +0000

Every chat.completions.create call sends your prompt to OpenAI's servers. If that prompt contains user data — support tickets, form inputs, CRM records — there's a good chance it includes names, emails, phone numbers, and worse.

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    {
      role: "user",
      content: `Summarize this support ticket:

      From: Sarah Chen <sarah.chen@acme.com>
      Phone: (415) 555-0142
      SSN: 521-44-8832

      My order #38291 hasn't arrived. I live at
      742 Evergreen Terrace, Springfield, IL 62704.`,
    },
  ],
});

That single request just sent a name, email, phone number, SSN, and home address to an external service. Under GDPR, CCPA, or HIPAA, that's a compliance incident waiting to happen.

The problem is invisible

Most teams don't audit what's inside their AI prompts. The Authorization header is your OpenAI key — that's expected. The problem is the request body.

PII shows up in places you don't expect:

Support tickets — customer names, emails, account numbers embedded in the text
RAG chunks — documents from your vector store may contain PII from the original source
Chat history — previous messages in a conversation accumulate identifiers
CRM data — customer records pulled into prompts for personalization
Code snippets — hardcoded credentials, API keys, database connection strings

And it's not just direct identifiers. Under GDPR, data is personal if it can be combined with other information to identify someone. A user ID + timestamp + location? That's personal data.

What you can do about it

There are three approaches, from manual to automated:

1. Manual redaction (doesn't scale)

Write regex patterns or use string replacement to strip known PII patterns before each API call. This works for obvious cases (emails, phone numbers) but misses freeform PII like names in unstructured text.

// Fragile and incomplete
const sanitized = input
  .replace(/[\w.-]+@[\w.-]+\.\w+/g, "[EMAIL]")
  .replace(/\d{3}-\d{2}-\d{4}/g, "[SSN]");

Problems: you have to maintain the patterns, they miss edge cases, and you can't restore the original values in the response.

2. NER-based detection (better, but heavy)

Run a Named Entity Recognition model (spaCy, Presidio, etc.) on every prompt before sending it. More accurate for names and organizations, but adds latency and infrastructure complexity.

3. Proxy-level redaction

Put a scanning proxy between your app and the AI provider. Every request is inspected and sanitized before it leaves your infrastructure. No code changes in your application.

This is the approach I built Grepture around — it's an open-source security proxy that sits in front of any AI API. Here's what the setup looks like:

import OpenAI from "openai";
import { Grepture } from "@grepture/sdk";

const grepture = new Grepture({
  apiKey: process.env.GREPTURE_API_KEY!,
  proxyUrl: "https://proxy.grepture.com",
});

const openai = new OpenAI({
  ...grepture.clientOptions({
    apiKey: process.env.OPENAI_API_KEY!,
    baseURL: "https://api.openai.com/v1",
  }),
});

// Every request is now scanned — your code doesn't change
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: userInput }],
});

clientOptions() reroutes traffic through the proxy. Your OpenAI key is forwarded securely. The proxy scans every request against 50+ detection patterns (80+ on Pro) — emails, phone numbers, SSNs, credit cards, API keys, IBANs, and more.

Reversible redaction: the key feature

Plain redaction breaks things. If you strip all names from a support ticket, the AI's summary is useless — "The customer [REDACTED] has an issue with [REDACTED]."

Reversible redaction (mask-and-restore) solves this. PII is replaced with consistent tokens:

What OpenAI sees:

Summarize this support ticket:
From: [PERSON_1] <[EMAIL_1]>
Phone: [PHONE_1]
SSN: [SSN_1]
My order #38291 hasn't arrived. I live at [ADDRESS_1].

What your app gets back:

The customer Sarah Chen (sarah.chen@acme.com) is asking about
order #38291 which hasn't been delivered to 742 Evergreen Terrace,
Springfield, IL 62704.

The model processes clean data with consistent entity references. Your application receives the full, personalized response. No PII ever reaches OpenAI.

Works with any provider

While I used OpenAI in these examples, the same proxy approach works with any AI provider — Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Mistral, Groq. You just change the baseURL and apiKey:

// Anthropic
const anthropic = new Anthropic({
  ...grepture.clientOptions({
    apiKey: process.env.ANTHROPIC_API_KEY!,
    baseURL: "https://api.anthropic.com",
  }),
});

// Google Gemini (OpenAI-compatible endpoint)
const gemini = new OpenAI({
  ...grepture.clientOptions({
    apiKey: process.env.GEMINI_API_KEY!,
    baseURL: "https://generativelanguage.googleapis.com/v1beta/openai",
  }),
});

For non-SDK calls (webhooks, custom HTTP requests), there's a drop-in fetch replacement:

const response = await grepture.fetch("https://api.example.com/data", {
  method: "POST",
  body: JSON.stringify(payload),
});

GDPR angle: why this matters now

If you're processing EU user data through AI APIs, every API call is a data transfer to a third-party processor. GDPR requires:

Data minimization — only send what's necessary
Data Processing Agreements — signed with every AI provider
Transfer Impact Assessments — for cross-border transfers to US providers

The simplest way to satisfy data minimization? Don't send personal data at all. Redact before the API call, restore after.

I wrote a longer guide on this: How to Make AI API Calls GDPR-Compliant.

Getting started

npm install @grepture/sdk
Get an API key at grepture.com — free tier includes 1,000 requests/month
Wrap your AI client with clientOptions() or use grepture.fetch()

The docs have setup guides for every major provider.