DEV Community

Vignesh Reddy
Vignesh Reddy

Posted on

Why I built Ajah after Helicone went into maintenance mode

The Problem

In March 2026, Helicone — one of the most popular
LLM observability tools — was acquired by Mintlify
and went into maintenance mode. Thousands of
developers were left looking for an alternative.

But the deeper problem wasn't just Helicone.
Every LLM observability tool available today has
one of these problems:

  • Cloud-locked (your prompts leave your server)
  • Acquired and abandoned
  • Only does one thing (cost OR observability OR evals)
  • Requires sending sensitive data to third parties

For enterprises in healthcare, finance, and
government — none of these tools work. They
legally cannot send prompts to external servers.

What I Built

Ajah is a self-hostable LLM gateway that sits
between your application and any LLM provider.

It does 5 things in one tool:

1. Gateway Proxy
Point your app at Ajah instead of OpenAI directly.
One line change. Supports 9 providers automatically
detected from your API key prefix.

2. RAG Verification
When your app uses retrieval-augmented generation,
Ajah verifies whether the LLM response is actually
grounded in your source documents. Contradictions
are flagged before they reach users.

3. Hallucination Flagging
Every response is scored for hallucination risk
in parallel — zero latency added. Uses local ML
models, no external API calls.

4. Multi-Agent Session Tracing
Visual step-by-step trace of every agent run.
Cost, quality, and

Top comments (4)

Collapse
 
harjjotsinghh profile image
Harjot Singh

Building after a tool you relied on went into maintenance mode is the most legitimate origin story there is, the gap is proven by your own dependency. LLM observability specifically is a space where you can't afford the tool to stall, you need cost, latency, and failures in real time or you're flying blind on spend. The thing that'll differentiate ajah: surfacing the signals that actually drive decisions (cost-per-feature, error/retry rates, where quality drops) over generic request logging. Observability is only useful if it changes what you do next. I care about exactly this for agent runs in Moonshift. What's the core signal ajah surfaces that Helicone didn't, or did worse?

Collapse
 
vignesh_reddy_53e403f62d2 profile image
Vignesh Reddy

Genuinely — Ajah solves three things with
high confidence right now:

  1. Full visibility into agent run costs —
    which step cost what, across the entire run

  2. RAG contradiction detection — when you
    pass source documents, Ajah catches
    responses that contradict or invent
    beyond those documents

  3. PII detection before storage — catches
    the common patterns reliably

What it doesn't solve perfectly yet:

Hallucination detection in open-ended
responses without source documents is
still approximate — we catch semantic
drift and ungrounding, but a confident
wrong answer that sounds plausible will
sometimes pass through. This is an
unsolved problem across the field,
not just in Ajah.

For Moonshift's agent runs specifically —
the session tracing and per-step cost
visibility will work exactly as described.
The quality scoring gives you directional
signal, not ground truth.

I'd rather you know the real boundaries
than discover them in production.

What does your current observability setup
look like for Moonshift agent runs?

Collapse
 
__5b6e8f677243ba4b2f60f profile image
Felix

Nice work! The LLM gateway/observability space has so much churn right now. We're seeing the same need from our users — they want one endpoint for all models without the complexity. Curious if you've compared your latency overhead vs going direct to providers?

Collapse
 
vignesh_reddy_53e403f62d2 profile image
Vignesh Reddy

Thank you and yes, the churn is exactly
what pushed us to build this.

On latency: we measured the gateway overhead
at under 2ms on all requests. The proxy layer
is written in Go specifically for this reason —
it intercepts, logs metadata, and forwards
without any blocking operations on the
critical path.

The async pipeline (quality scoring, RAG
verification, cost attribution) runs entirely
after the response is returned to the caller.
So your users never wait for any of that.

Happy to share the benchmark methodology if
useful we tested against direct provider
calls over 1000 requests across Groq, OpenAI,
and Anthropic.

What's the primary use case your users are
asking for cost visibility, compliance,
or something else?