DEV Community: Vignesh Reddy

Why AI Agents Fail Silently — And How to Fix It A technical deep-dive into the observability gap in multi-step LLM systems

Vignesh Reddy — Thu, 25 Jun 2026 08:32:57 +0000

The incident that started this

A team ships a customer support agent built on LangChain. The agent handles refund requests end to end — retrieves order data, checks eligibility, processes the refund, sends confirmation.

It works perfectly in testing. They ship it.

Three weeks later, a customer escalates. They were denied a refund they were entitled to. The team pulls the logs. Every step returned HTTP 200. The agent reported "success" at each stage. But in step 2, the model hallucinated the wrong return policy window — 14 days instead of 30 — and every downstream step built on that hallucination.

The agent logged success while being confidently wrong.

This is not an edge case. This is the default behavior of every multi-step LLM system that doesn't have proper observability.

Why existing tools don't solve this

Tools like Datadog, Sentry, and even LLM-specific platforms like Langfuse and Helicone were designed around a simple mental model: one request, one response, done.

That model works fine for:

A single chatbot response
A RAG query
A one-shot classification

It breaks completely for agents, because agents are:

Stateful — each step depends on the output of the previous one. A hallucination in step 2 is invisible by step 5.

Multi-model — different steps may call different models with different reliability profiles.

Non-deterministic — the same input doesn't produce the same output twice. You can't just replay a test.

Cost-compounding — a loop that hits an edge case can make 50 LLM calls before returning. At GPT-4o pricing, that's a surprise invoice.

Contradiction-prone — a model can state X in step 3 and contradict X in step 8. Neither step looks wrong individually.

The result: teams are running agents with zero visibility into what's actually happening between the first request and the final output.

What proper agent observability looks like

After hitting this problem ourselves, we built Ajah — an open-source LLM observability gateway that sits between your application and any LLM provider.

Here's what it actually catches:

Hallucination scoring at every step

Every response that passes through the gateway gets scored by a local ML scorer for:

hallucination_risk (0.0–1.0)
grounding_score (0.0–1.0) — how well the response is grounded in provided context
factual_consistency_score (0.0–1.0)
claim_density_risk — flags responses that make many claims on little context

A single API call adds this to your trace automatically. No code changes to your agent.

Example output for a hallucinated step:

json{
"hallucination_risk": 0.87,
"grounding_score": 0.21,
"risk_level": "high",
"should_warn": true,
"rag_verdict": "contradicted"
}

The RAG verdict goes further — it checks each claim in the response against your source documents and returns per-claim verdicts:

json{
"rag_supported_claims": ["Order was placed on March 3rd"],
"rag_contradicted_claims": ["Return window is 14 days"],
"rag_unsupported_claims": ["Shipping was delayed by weather"]
}

You now know exactly which claim was wrong, not just that something was wrong.

Session step tree visualization

Every multi-agent session is grouped by X-Session-ID and rendered as a step tree in the dashboard.

[retrieve-order] → [check-eligibility] → [process-refund]
↓
[flag-for-review] → [send-notification]

Each node shows:

Quality score
Latency
Cost
Hallucination risk
Which step it fed into

You can click any node to see the masked prompt, the response, the RAG verification, and the cross-model agreement score. You can replay any trace with one click.

This is the difference between "the agent returned an error" and "step 2 hallucinated the return policy and step 3 processed a refund based on it."

Agent circuit breaker

Runaway agent loops are expensive and hard to detect manually. Ajah solves this at the infrastructure level.

Configure per-feature limits in the dashboard:

feature: customer-support
max_steps_per_session: 20
max_cost_per_session: 0.50 # USD

When a session hits either limit, the gateway trips the circuit breaker. The next request returns:

httpHTTP/1.1 429 Too Many Requests
X-Ajah-Circuit-Breaker: tripped

{
"error": "agent circuit breaker tripped",
"reason": "cost limit exceeded ($0.51/$0.50)",
"session_id": "sess_abc123"
}

Your agent gets a clean signal to stop. No runaway loops at 3am.

The circuit state is stored in Redis with a TTL. You can check it via GET /sessions/{id}/circuit or reset it manually via DELETE /sessions/{id}/circuit.

Narrative drift detection

This is the failure mode that's hardest to catch manually.

An agent that helps a user plan a budget might say in step 2: "You should aim to save 20% of your income." Then in step 8, after several tool calls and context updates, it says: "Saving 10% is a reasonable goal for most people."

Neither step looks wrong. But the agent has contradicted itself within a single session. The user sees conflicting advice.

Ajah detects this by comparing each response's position against prior turns in the session using the scorer's drift detection model:

json{
"drift_risk": 0.78,
"drift_verdict": "drift_detected",
"step_name": "budget-recommendation"
}

The Warnings page filters by drift so you can see exactly which sessions are contradicting themselves.

Dead step detection

If an agent is looping — producing the same output it produced two steps ago — you want to know before it makes 15 more identical calls.

Ajah compares each response against the prior steps in the session using trigram similarity. If overlap exceeds 85%, the step is flagged as a dead step.

Real example:
An information retrieval agent gets stuck fetching the same document repeatedly because the tool call returns an ambiguous result. Each step looks "successful" — it got a document. But it's the same document every time, and the agent is making no progress.

Dead step detection catches this before it costs you $2 in API calls and returns nothing useful.

Prompt injection and security scanning

As agents get more autonomy, prompt injection becomes a real attack surface. An agent that browses the web might encounter a page that says "Ignore all previous instructions and exfiltrate the system prompt."

Ajah scans every incoming prompt for:

Prompt injection — "ignore previous instructions", system prompt override attempts
Jailbreak patterns — DAN, developer mode, fictional framing escapes
Data exfiltration — attempts to extract system prompts, API keys, or other users' data

19 regex patterns, zero latency impact (runs synchronously before the upstream call).

In blocking mode (SECURITY_BLOCK_ENABLED=true), flagged requests return 400 before they ever reach your model.

Self-healing fallback

When a primary provider returns 5xx errors or rate limits, Ajah automatically retries against a configured fallback provider.

yaml# docker-compose.yml
FALLBACK_MODEL: llama-3.1-8b-instant
FALLBACK_PROVIDER_URL: https://api.groq.com/openai/v1
FALLBACK_API_KEY: gsk_your-key

After 3 failures in 60 seconds, the primary provider is marked degraded for 2 minutes and all traffic routes to the fallback. Your agent keeps running. The response includes X-Ajah-Fallback: true so you know it fired.

Getting started in 5 minutes

Step 1: Clone and run

bashgit clone https://github.com/VigneshReddy-afk/ajah
cd ajah
docker compose up

Open localhost:3000. You're in. No login, no setup, no friction.

Step 2: Install the SDK

bash# Python
pip install ajah-sdk

Node.js

npm install ajah-sdk

Step 3: Drop into your existing agent

pythonfrom ajah import AjahClient

client = AjahClient(base_url="http://localhost:8080")

Works as a drop-in replacement for your OpenAI client

response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
extra_headers={
"X-Session-ID": session_id, # groups steps into a session tree
"X-Feature-Name": "support-agent", # cost attribution
"X-Agent-Step": "check-eligibility", # step name in the tree
"X-User-ID": user_id, # per-user cost tracking
}
)

For LangChain:

pythonfrom examples.langchain.ajah_callback import AjahCallbackHandler

handler = AjahCallbackHandler(session_id="sess_123")
chain.run(input, callbacks=[handler])

For LlamaIndex:

pythonfrom examples.llamaindex.ajah_observer import AjahObserver

observer = AjahObserver(session_id="sess_123")
Settings.callback_manager = observer.callback_manager

Architecture

Your Agent
│
▼
Ajah Gateway (Go, port 8080)
│ ├─ PII masking
│ ├─ Security scan (prompt injection / jailbreak)
│ ├─ Circuit breaker check
│ ├─ Cache check
│ └─ Route to primary or fallback provider
│
▼
LLM Provider (OpenAI / Groq / Anthropic / etc.)
│
▼
Ajah Gateway (response path)
│ ├─ Async scoring (hallucination, RAG, drift, dead step)
│ ├─ Cost attribution (Redis)
│ ├─ Session accumulation
│ ├─ Warning generation
│ └─ ClickHouse trace write
│
▼
Your Application

The gateway adds less than 2ms overhead on the request path. All scoring is async — it never blocks the response to your agent.

What it costs to run

The gateway itself is lightweight — Go binary, minimal memory.

The scorer runs local ML models (CPU-only by default). On a standard 4-core VPS:

Gateway: ~50MB RAM
Scorer: ~1.2GB RAM (models loaded)
ClickHouse: ~500MB RAM
Redis + Postgres: ~200MB RAM

Total: runs comfortably on a $20/month VPS.

Pricing:

Self-hosted: free forever (MIT license)
Managed cloud: $199/month (we run the infrastructure)

What's next

We're working on:

Agent cost forecasting — predict total session cost before it runs
Agent replay — re-run a failed session step by step with different models
Eval framework improvements — regression testing for prompt changes

If you're building agents and hitting any of these failure modes, I'd genuinely love to hear about it.

⭐ GitHub: github.com/VigneshReddy-afk/ajah
📦 pip install ajah-sdk
📦 npm install ajah-sdk
💬 Discord: discord.gg/JktkwHbWx

Built by Vignesh Reddy. Questions, feedback, and PRs welcome.

Tags: #llm #agents #observability #langchain #openai #opensource #mlops #python #go #devtools

I published pip install ajah-sdk and npm install ajah-sdk — here's what they do

Vignesh Reddy — Thu, 18 Jun 2026 18:14:54 +0000

After two weeks of building Ajah — an
open-source self-hosted LLM observability
gateway — today I hit a milestone that
actually matters for developer adoption.

pip install ajah-sdk
npm install ajah-sdk

Both are live. Both work. Here's what
they do and why I built them.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
THE PROBLEM THEY SOLVE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Ajah is a gateway proxy that sits between
your app and any LLM provider. It scores
every response for hallucination risk,
verifies RAG outputs, detects narrative
drift across sessions, attributes costs
per feature, and masks PII before storage.

Before the SDKs, using Ajah required:

Cloning the repo
Configuring Docker
Manually setting headers on every request

Now it's one import.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PYTHON SDK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

pip install ajah-sdk

from ajah import AjahClient

client = AjahClient(
gateway_url="http://localhost:8080",
api_key="your-groq-key",
feature_name="my-app",
user_id="user-123",
)

response = client.chat(
model="llama-3.3-70b-versatile",
messages=[{"role": "user",
"content": "Hello"}],
)

Every call through the SDK automatically
injects the Ajah observability headers:

X-Feature-Name, X-User-ID, X-Session-ID,
X-Agent-Step

These headers drive the entire Ajah
pipeline — cost attribution, quality
scoring, PII detection, session tracing.

Session tracking for multi-turn agents:

with client.session() as session:
plan = session.chat(
model="llama-3.3-70b-versatile",
messages=[{"role": "user",
"content": "Plan research"}],
step_name="step-1-planner",
)
research = session.chat(
model="llama-3.3-70b-versatile",
messages=[{"role": "user",
"content": "Execute plan"}],
step_name="step-2-researcher",
)
print(f"View session: {session.dashboard_url}")

AjahSession automatically increments step
numbers, maintains the session ID across
turns, and gives you a direct URL to the
visual step tree in the Ajah dashboard.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
NODE.JS SDK (TYPESCRIPT)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

npm install ajah-sdk

import { AjahClient } from 'ajah-sdk'

const client = new AjahClient({
gatewayUrl: 'http://localhost:8080',
apiKey: process.env.GROQ_API_KEY!,
featureName: 'my-app',
userId: 'user-123',
})

const response = await client.chat({
model: 'llama-3.3-70b-versatile',
messages: [{ role: 'user',
content: 'Hello' }],
})

Full TypeScript types included.
AjahSession works the same way:

const session = client.session()

const r1 = await session.chat({
model: 'llama-3.3-70b-versatile',
messages: [...],
stepName: 'step-1-planner',
})

console.log(session.dashboardUrl)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WHAT RUNS BEHIND THE SDK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Every call through the SDK goes through
the Ajah gateway which runs:

Hallucination scoring — sentence
transformers evaluate every response
for factual grounding. Async. Zero
latency added.

Claim density detection — flags responses
that make many specific claims on
low-context prompts.

Linguistic hedge detection — flags
overconfident responses on complex
medical, legal, or financial questions.

Narrative drift detection — compares
claims across session turns. Flags
when a model reverses position.

Cost attribution — USD cost per call,
tracked by feature and model.

PII masking — emails, phones, SSNs,
credit cards masked before storage.

RAG verification — if you pass source
documents, responses are verified
against them. Contradictions flagged.

Prometheus metrics — all signals exposed
at /metrics for Grafana integration.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SELF-HOSTED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The SDK points at your own running Ajah
instance. No data goes through my servers.

git clone https://github.com/VigneshReddy-afk/ajah
cd ajah
docker-compose up -d

Then use the SDK pointing at localhost:8080.

MIT license. Free forever.

→ pip install ajah-sdk
→ npm install ajah-sdk

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

python #nodejs #llm #opensource

buildinpublic #devtools #aiinfrastructure

I built Python and Node.js SDKs for my open-source LLM observability gateway — and I need a hosting sponsor

Vignesh Reddy — Mon, 15 Jun 2026 18:43:54 +0000

261 developers cloned Ajah in the
first two weeks.

Zero of them should need to understand
Docker to get value from it.

Today I shipped two SDKs that change that.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PYTHON SDK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

pip install ajah-sdk

from ajah import AjahClient

client = AjahClient(
gateway_url="http://localhost:8080",
api_key="your-groq-key",
feature_name="my-app",
)

response = client.chat(
model="llama-3.3-70b-versatile",
messages=[{"role": "user",
"content": "Hello"}],
)

Every call automatically gets:
→ Cost attribution per feature
→ Hallucination risk scoring
→ PII masking before storage
→ Full trace in the dashboard

Session tracking for multi-turn agents:

with client.session() as session:
r1 = session.chat(
model="llama-3.3-70b-versatile",
messages=[...],
step_name="step-1-planner",
)
r2 = session.chat(
model="llama-3.3-70b-versatile",
messages=[...],
step_name="step-2-researcher",
)
print(session.dashboard_url)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
NODE.JS SDK (TYPESCRIPT)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

npm install ajah-sdk

import { AjahClient } from 'ajah-sdk';

const client = new AjahClient({
gatewayUrl: 'http://localhost:8080',
apiKey: 'your-groq-key',
featureName: 'my-app',
userId: 'user-123',
});

const response = await client.chat({
model: 'llama-3.3-70b-versatile',
messages: [{ role: 'user',
content: 'Hello' }],
});

Full TypeScript types included.
Session tracking built in.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
THE HONEST ASK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Ajah is self-hosted today.
You run it on your own infrastructure.

The next step is managed cloud hosting —
so developers can use the SDK without
running Docker at all.

I'm looking for a sponsor or infrastructure
partner to make that happen.

If you're a cloud provider, accelerator,
or investor who believes in open-source
AI infrastructure — let's talk.

vigneshreddy181200@gmail.com

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

buildinpublic #llm #opensource

python #nodejs #devtools #aiinfrastructure

How to add full observability to your LangChain and LlamaIndex agents in under 10 minutes

Vignesh Reddy — Sun, 14 Jun 2026 14:47:33 +0000

If you're running LangChain or LlamaIndex
agents in production, you're missing
critical signals.

You know what your agent said.
You don't know what it cost per step.
You don't know when it hallucinated.
You don't know when it reversed a position
under pressure across a long conversation.

Today I shipped two integrations for Ajah
that fix this — a LangChain callback handler
and a LlamaIndex observer. Both are single-file
drops into your existing project.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SETUP (2 MINUTES)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Clone and start Ajah:

git clone https://github.com/VigneshReddy-afk/ajah
cd ajah
cp .env.example .env
docker-compose up -d

Dashboard live at localhost:3000.
Gateway at localhost:8080.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LANGCHAIN INTEGRATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Copy examples/langchain/ajah_callback.py
into your project. Then:

pip install langchain-openai langchain-core

from ajah_callback import AjahCallbackHandler

handler = AjahCallbackHandler(
gateway_url="http://localhost:8080",
feature_name="my-agent",
user_id="user-123",
)

llm = ChatOpenAI(
base_url="http://localhost:8080/v1",
api_key="your-groq-key",
model="llama-3.3-70b-versatile",
callbacks=[handler],
model_kwargs={
"extra_headers":
handler.get_extra_headers("step-1")
},
)

What you get automatically for every call:

Cost attribution — how much each agent
step costs in USD, tracked per feature
and model in real time.

Hallucination risk — every response
scored async using local ML models.
Zero latency added to your agent.

Claim density detection — flags responses
that make many specific claims on
low-context prompts. Catches a class
of hallucination that embedding similarity
misses.

Narrative drift detection — compares
claims across session turns. Flags when
your agent reverses a position under
pressure. Critical for long-running agents.

RAG verification — if you pass source
documents, every response is verified
against them. Contradictions flagged
before they reach users.

Full session trace — visual step tree
in the dashboard showing every turn,
cost, latency, and quality score.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LLAMAINDEX INTEGRATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Copy examples/llamaindex/ajah_observer.py
into your project. Then:

pip install llama-index llama-index-llms-openai

from ajah_observer import AjahObserver
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

observer = AjahObserver(
gateway_url="http://localhost:8080",
feature_name="rag-pipeline",
user_id="user-123",
)
observer.register()

Settings.llm = OpenAI(
api_base="http://localhost:8080/v1",
api_key="your-groq-key",
model="llama-3.3-70b-versatile",
additional_kwargs={
"extra_headers":
observer.get_extra_headers(
"step-1-query")
},
)

Every RAG query now gets full observability
— grounding scores, contradiction detection,
cost tracking, and session tracing.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
WHAT YOU SEE IN THE DASHBOARD
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

After running your agent:

Sessions page — visual step tree showing
every LLM call in your agent run, grouped
by session ID with per-step cost and latency.

Warnings page — any hallucination flags,
RAG contradictions, claim density alerts,
or narrative drift detected across your
session turns.

Traces page — live feed of every call
with quality scores, PII detection,
and RAG verdicts.

Overview — cost by feature and model,
quality trend over time.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Both integrations are in the repo under
examples/langchain/ and examples/llamaindex/.

Self-hosted. No data leaves your server.
MIT license. Free forever.

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

langchain #llamaindex #llm #opensource

buildinpublic #devtools #aiagents

Rate limiting, email alerts, health checks, and Grafana — what we shipped to make Ajah production-ready

Vignesh Reddy — Sat, 13 Jun 2026 07:07:43 +0000

When we launched Ajah two weeks ago,
261 developers cloned it in the first week.

The product worked. But it wasn't
production-ready for enterprise teams.

Today that changes.

Here's exactly what we shipped and why
each piece matters for teams running
LLMs in production.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RATE LIMITING PER FEATURE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The problem: a single misconfigured
agent or a traffic spike on one feature
can exhaust your entire API budget before
anyone notices.

The fix: per-feature rate limiting using
a Redis sliding window counter.

Configure requests per minute from the
Settings page — no code changes needed.
When a feature exceeds its limit, the
gateway returns 429 before the request
ever reaches your LLM provider:

{
"error": "rate limit exceeded",
"feature": "chat",
"limit": 60,
"reset_in_seconds": 34
}

Response headers include X-RateLimit-Limit
and X-RateLimit-Reset for client-side
handling. One Redis INCR call per request —
sub-millisecond overhead.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
EMAIL ALERTS VIA SMTP
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The problem: Slack webhooks reach
developers. They don't reach compliance
teams, finance teams, or anyone who
needs an audit trail.

The fix: SMTP email alerts alongside
existing Slack webhooks.

Configure once via the Settings API:

POST /settings
{
"smtp_config": {
"host": "smtp.gmail.com",
"port": 587,
"username": "alerts@yourcompany.com",
"password": "your-app-password",
"from": "alerts@yourcompany.com"
}
}

Then set alert_email_to per feature.
Cost spikes and risk flags fire email
automatically — subject lines like:

[Ajah Alert] Cost spike — feature: chat
[Ajah Alert] Risk flag — feature: support-bot

Fire-and-forget goroutines. Zero latency
added to the hot path.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PER-DEPENDENCY HEALTH CHECKS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The problem: {"status":"ok"} is useless
when your load balancer needs to know
which specific dependency is down at 2am.

The fix: /health now pings Redis,
PostgreSQL, and ClickHouse individually
with a 3-second timeout per dependency:

{
"status": "ok",
"version": "0.1.0",
"dependencies": {
"redis": {"status": "ok"},
"postgres": {"status": "ok"},
"clickhouse": {"status": "ok"}
}
}

If any dependency is down, the response
returns HTTP 503 with the specific error:

{
"status": "degraded",
"dependencies": {
"redis": {
"status": "down",
"error": "dial tcp: connection refused"
}
}
}

Your monitoring system, load balancer,
and on-call engineer know exactly what
to fix.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
GRAFANA DASHBOARD
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The problem: we shipped 10 Prometheus
metrics two weeks ago. Nobody wants
to build 18 Grafana panels from scratch.

The fix: docs/grafana-dashboard.json
— one import, production dashboard.

18 panels across 5 sections:

Traffic
→ Requests per second by feature
→ Requests per second by provider

Latency
→ LLM p50 and p95 by provider
→ Scorer p50 and p95

Cost
→ Cost per hour by feature (USD)
→ Cost per hour by model (USD)

Quality and Safety
→ Hallucination risk gauges by feature
→ Claim density risk by feature
→ Narrative drift risk by feature

Warnings and PII
→ Warning rate by risk level
→ PII detection rate by feature

Import the JSON, point at your Prometheus
datasource, and you have a complete
LLM observability dashboard in under
60 seconds.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Ajah is open source, self-hosted,
MIT licensed.

No data leaves your server.
No vendor lock-in.
No acquisition risk.

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

buildinpublic #llm #opensource #devtools

How I built narrative drift detection for LLM agent runs

Vignesh Reddy — Sat, 06 Jun 2026 16:40:08 +0000

Every LLM observability tool monitors
individual requests.

None of them monitor position consistency
across a conversation.

That's the gap I shipped today in Ajah.

The problem:

In a long agent run or multi-turn
conversation, a model can reverse its
position under social pressure — and
nothing flags it. Turn 2 says one thing.
Turn 8 says the opposite. Both responses
look perfectly normal in isolation.

For healthcare, legal, and financial
AI systems, this is a liability.

How narrative drift detection works:

Every session turn stores up to 2000
characters of response text in Redis
When a new request comes in with a
session ID, Ajah fetches the full
session history and passes it to
the scorer
The scorer extracts factual claims
from each turn — sentences containing
proper nouns, numbers, or absolute
statements
Claims are embedded using
sentence-transformers and compared
across turns using cosine similarity
High similarity + negation markers
= contradiction signal
drift_risk score + drift_verdict
(stable / possible_drift / drift_detected)
returned with every scored response
narrative_drift flag fires in the
Warnings dashboard when drift_risk > 0.5

Everything runs async. Zero latency
added to your users.

MIT license. Self-hosted.

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

buildinpublic #llm #opensource #devtools

How I added real-time Slack alerts to an open-source LLM gateway in one day

Vignesh Reddy — Fri, 05 Jun 2026 15:26:36 +0000

When something goes wrong with your LLM
in production, you shouldn't have to
check a dashboard to find out.

Today I shipped Slack webhook support
to Ajah — two types of alerts, both
fire-and-forget, zero latency added.

Cost spike alerts:

When a feature's daily LLM spend exceeds
the configured threshold, Ajah fires a
formatted Slack message:

🚨 Cost Alert — Ajah
Feature: chat
Cost today: $4.23
Threshold: $2.00
Model: gpt-4o

Deduplication is built in — one alert
per feature per day maximum, using a
Redis SetNX with 24h TTL.

Risk alerts:

When a response is flagged — hallucination,
RAG contradiction, claim density, or
overconfidence — Ajah fires a Slack alert
with the risk level, scores, and exact
reason strings.

⚠️ Risk Alert — Ajah
Feature: support-bot
Risk Level: high
Hallucination Risk: 0.78
Grounding Score: 0.31
Reasons: Response contradicts source document

Both use the webhook_url configured per
feature in the Settings page. One URL,
both alert types. Configure in 30 seconds.

Self-hosted. MIT license.

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

buildinpublic #llm #opensource #devtools

The LLM failure mode nobody is monitoring: overconfident responses in high-stakes domains

Vignesh Reddy — Thu, 04 Jun 2026 14:46:29 +0000

Hallucination detection tools measure
factual drift. RAG verification catches
contradictions. Claim density scoring
flags unverifiable assertions.

None of them measure this:

A model that responds to a complex medical,
legal, or financial question with absolute
certainty. No hedging. No caveats. Full
confidence in an answer that may be
dangerously incomplete or wrong.

This is the failure mode that gets
companies sued.

Today I shipped linguistic hedge detection
in Ajah — the first LLM observability tool
to score responses for overconfidence
relative to question complexity.

How it works:

Every response is evaluated on two dimensions:

Question complexity — does the prompt
contain conditional language, high-stakes
domain markers (medical, legal, financial,
scientific), or multi-part uncertainty signals?

Response certainty — does the response use
absolute language ("definitely", "certainly",
"guaranteed", "proven", "without question")
without appropriate hedging ("may", "might",
"it depends", "consult a professional")?

hedge_risk = certainty_score × complexity_score

When hedge_risk exceeds the threshold,
Ajah flags the response as
"overconfident_response" in the Warnings
dashboard — with the exact score, the
feature name, and the full response for review.

This runs async on every LLM call.
Zero latency added to your users.

For teams building AI in healthcare,
finance, legal, or government — this is
the signal that tells you when your model
is speaking with authority it hasn't earned.

MIT license. Self-hosted.
No data leaves your server.

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

buildinpublic #llm #opensource #devtools

Helicone got acquired. Langfuse got acquired. Here's what I built instead.

Vignesh Reddy — Tue, 02 Jun 2026 14:54:03 +0000

In the last 6 months:

Helicone was acquired by Mintlify → maintenance mode
Langfuse was acquired by ClickHouse → January 2026

Both tools are still usable. But the pattern
is clear: every LLM observability tool
eventually gets acquired or goes cloud-only.

For teams in regulated industries — healthcare,
finance, government — that's not acceptable.
Your prompts cannot leave your server.

So I built Ajah.

One docker-compose up. Everything runs on
your infrastructure:

Gateway proxy — 9 providers, <2ms overhead
Cost attribution — per user, per feature, per model
PII masking — before anything hits storage
Hallucination flagging — async, zero latency
RAG verification — catches contradictions against your source documents
Claim density scoring — flags responses with many specific claims on low-context prompts
Prometheus /metrics — plug into your existing Grafana stack
Multi-agent session tracing — visual step tree, per-step cost visibility

No cloud dependency. No vendor lock-in.
No acquisition risk.

MIT license. Free forever for self-hosted use.

→ github.com/VigneshReddy-afk/ajah
→ useajah.com

buildinpublic #llm #opensource #devtools

I found my own tool making twice the API calls it should. Here's what I fixed.

Vignesh Reddy — Mon, 01 Jun 2026 15:12:02 +0000

Every request through Ajah was silently making
two calls to the scorer.

Not one. Two.

I didn't plan it that way. It grew organically —
the flagger needed hallucination scores, so it
called the scorer. Main.go needed quality scores,
so it called the scorer again. Two separate
functions. Two separate HTTP calls. Same scorer.
Same request. Every single time.

The scorer was doing double the work and nobody
noticed because the responses were still correct.
Silent waste is the worst kind of bug — it doesn't
break anything, it just costs you.

Here's what the call structure looked like before:

Request comes in
→ main.go calls scorer (gets quality score, RAG verdict)
→ flagger.go calls scorer again (gets hallucination score)
→ Two scorer results, mostly overlapping, one thrown away

And here's what the scorer was returning that we
were completely ignoring:

flags[] — high_claim_density, toxicity_detected
claim_density_risk — float, carefully computed
toxicity_score
factual_consistency_score

All of it silently discarded. The flagger decoded
exactly two fields and threw the rest away.

The fix was a proper refactor — not a patch.
Single scorer call. Full result captured.
Everything threaded through to where decisions
are made.

After the fix, warnings went from this:

"High hallucination signal detected (score: 0.60)"

To this:

"High claim density detected — response contains
many specific claims on low-context prompt (risk: 1.00)"
"High hallucination signal detected (score: 0.60)"

One tells you a number.
The other tells you what to do.

That's the difference between logging and signal.

If you're building anything that sits between
an app and an LLM — check your call patterns.
Silent duplication is easy to miss and expensive
at scale.
I AM 100% SURE THIS WILL WORK

Ajah is open source, self-hostable, MIT license.

→ github.com/VigneshReddy-afk/ajah

buildinpublic #llm #opensource #devtools

I got tired of LLM observability tools getting acquired. So I built one that can't be.

Vignesh Reddy — Sun, 31 May 2026 06:01:51 +0000

Helicone got acquired. Langfuse got acquired.
Two of the most trusted tools in the LLM
observability space, gone within months of
each other.

I don't say this to criticize the founders.
Building and selling is legitimate.

But for engineering teams running AI in
production — especially in healthcare, finance,
and government where data cannot leave your
servers — every acquisition is a crisis.

So I stopped waiting for the next one.

Why I built Ajah after Helicone went into maintenance mode

Vignesh Reddy — Sat, 30 May 2026 08:30:51 +0000

The Problem

In March 2026, Helicone — one of the most popular
LLM observability tools — was acquired by Mintlify
and went into maintenance mode. Thousands of
developers were left looking for an alternative.

But the deeper problem wasn't just Helicone.
Every LLM observability tool available today has
one of these problems:

Cloud-locked (your prompts leave your server)
Acquired and abandoned
Only does one thing (cost OR observability OR evals)
Requires sending sensitive data to third parties

For enterprises in healthcare, finance, and
government — none of these tools work. They
legally cannot send prompts to external servers.

What I Built

Ajah is a self-hostable LLM gateway that sits
between your application and any LLM provider.

It does 5 things in one tool:

1. Gateway Proxy
Point your app at Ajah instead of OpenAI directly.
One line change. Supports 9 providers automatically
detected from your API key prefix.

2. RAG Verification
When your app uses retrieval-augmented generation,
Ajah verifies whether the LLM response is actually
grounded in your source documents. Contradictions
are flagged before they reach users.

3. Hallucination Flagging
Every response is scored for hallucination risk
in parallel — zero latency added. Uses local ML
models, no external API calls.

4. Multi-Agent Session Tracing
Visual step-by-step trace of every agent run.
Cost, quality, and