Cohorte

Posted on Apr 21 • Originally published at cohorte.co

How We Certify AI Reliability With One Number — Conformal Prediction for LLMs (Open Source)

#ai #python #machinelearning #opensource

Preview text: Most AI teams ship with dashboards, eval suites, and a strong opinion. We wanted something harder to argue with: one number, backed by conformal prediction, that tells us whether an AI system is ready to ship.

AI teams do not have a benchmark problem.

We have a deployment problem.

Once a model leaves the lab and lands inside a product, a workflow, or an agent, the real question is no longer whether it looked strong on a leaderboard. The real question is whether the system is reliable enough to trust in production, on the tasks it will actually face, with the architecture it will actually run. That is the gap TrustGate is built to close. TrustGate certifies the reliability of any AI endpoint using self-consistency sampling and conformal prediction, producing a single reliability level backed by a formal statistical guarantee. It is black-box, requires no model internals, and works across providers.

We built TrustGate because too much of AI reliability still gets expressed as vibes with charts.

A model “seems stable.”
A workflow “looks good in evals.”
A prompt stack “passed our test set.”

That is not nothing. But it is also not a release gate.

What we wanted was stricter: one number that tells us whether an AI system is ready to ship.

Not a hand-wavy confidence score.

Not an internal probability that disappears the moment you switch providers.

A deployment-grade reliability statement with conformal coverage behind it.

That is why we describe TrustGate so directly in the repo: know if your AI is ready to ship with one number and one guarantee.

The problem.

Most AI systems fail in one of two ways.

The first is obvious failure. The answer is wrong. The user notices. Everyone has a bad afternoon.

The second is worse. The answer is polished, plausible, and confidently wrong or right only under the exact distribution you happened to test last week. That is the failure mode that survives demos, slips past optimistic evals, and shows up in production where trust actually matters.

That is why we do not think accuracy alone is enough. It is also why we do not think raw model confidence is enough. In production, reliability has to be measured at the system boundary: the model plus prompt plus retrieval plus tool layer plus all the wiring around it. That is the unit users experience. That is the unit teams ship. That is the unit we wanted to certify.

That black-box stance is not a nice feature we added later. It is the foundation.

Most serious AI systems are not neat single-model lab artifacts. We are stitching together providers, prompts, retrieval systems, tools, policies, and orchestration. If a reliability method only works when we own the model internals, it misses the surface where real deployment risk actually lives.

So we built TrustGate for the system we run, not the idealized model behind it.

What we learned

The first thing we learned is that reliability gets a lot clearer when we stop pretending one sample is enough.

TrustGate starts from a simple practical observation: when you ask the same question multiple times, the pattern of agreement tells you something real. When the system knows, answers tend to converge. When it does not, they scatter. That agreement structure becomes the raw material for certification.

That is why TrustGate follows a clean sequence:
sample repeatedly, canonicalize equivalent answers, calibrate against labels, then certify a reliability level using conformal prediction.

The second thing we learned is that teams do not need more uncertainty vocabulary. We need a decision primitive.

That is why the one-number framing matters so much to us.

A reliability level is not the whole story, but it is the right top-line story. It compresses a messy statistical question into something a developer can automate, a platform team can gate on, and an AI leader can defend in a release review. If the number clears the bar, we ship with more confidence. If it does not, we know the system needs more work. That is a much better operating model than “we felt decent about the eval set.”

The third thing we learned is that many real tasks do not come with clean labels sitting around waiting for us.

So TrustGate supports both benchmark-style ground truth and human calibration. If we have labeled answers, great. If we do not, we can export a questionnaire, let a reviewer identify acceptable answers, and certify from there. And if we need a faster but less rigorous path, we can use auto-judge. We built all three because the bottleneck in production is often not the math. It is the workflow around the math.

The architecture

At a high level, TrustGate has a clean four-step architecture:

Sample the AI the same question K times
Canonicalize raw outputs into comparable answers
Calibrate with human or ground-truth labels
Certify a reliability level using conformal prediction

That looks simple. It is supposed to.

The point was never to make reliability feel mystical. The point was to make it rigorous and operable.

Sampling matters because one generation is a weak basis for trust.

Canonicalization matters because equivalent answers should collapse into the same bucket.

Calibration matters because observed answer profiles need to turn into nonconformity scores.

Certification matters because we do not just want a descriptive metric.

We want a reliability statement with teeth.

There is also a practical systems detail here that we cared a lot about: cost. Repeated sampling is useful, but it gets expensive fast if you do it naively. That is why TrustGate includes sequential stopping based on Hoeffding bounds. It cuts API cost substantially and makes repeated sampling realistic enough to use beyond a paper figure.

Each layer/library with example

Here is the quickest way into TrustGate:

pip install theaios-trustgate

That is the actual quickstart install because we wanted the on-ramp to feel like infrastructure, not a research project. Install the package. Point it at the endpoint. Run certification. Read the result.

For the simplest quickstart, we use this exact trustgate.yaml:

# trustgate.yaml

# The AI system you're certifying (any OpenAI-compatible endpoint)
endpoint:
  url: "https://api.openai.com/v1/chat/completions"
  model: "gpt-4.1-mini"
  api_key_env: "LLM_API_KEY"               # reads from environment variable
  # Or use custom auth headers for LiteLLM, Azure, etc.:
  # headers:
  #   API-Key: "your-key-here"

# The judge LLM — used for canonicalization (grouping answers)
# and calibration (matching ground truth to canonical answers).
# Use a cheap, fast model. Can be the same or different provider.
canonicalization:
  type: "llm"
  judge_endpoint:
    url: "https://api.openai.com/v1/chat/completions"
    model: "gpt-4.1-nano"
    api_key_env: "LLM_API_KEY"
    # Or custom auth (same headers option as endpoint):
    # headers:
    #   API-Key: "your-key-here"

That config says a lot about the design. TrustGate is not trying to be mystical. It is declarative. We point it at the endpoint, set the sampling behavior, choose the canonicalization path, define the calibration split, and supply questions in a CSV. Reliability infrastructure should feel like infrastructure. This does.

The certification command:

trustgate certify

And the docs show this exact example result:

     Pre-flight Estimate
┌──────────────────────────┬───────────────────────────────┐
│ Questions                │ 120                           │
│ Samples per question (K) │ 10                            │
│ Requests                 │ 600                           │
│ Sequential stopping      │ enabled (~50% fewer requests) │
│ Est. cost                │ $0.53                         │
│ Measured latency         │ 0.8s per call                 │
│ Est. time                │ ~1.2 min                      │
└──────────────────────────┴───────────────────────────────┘
              Cost / Reliability Tradeoff
┌────┬──────────┬───────────┬───────────┬────────────┐
│  K │ Requests │ Est. Cost │ Est. Time │ Resolution │
│  3 │      180 │ $0.16     │ ~20s      │   coarse   │
│ 10←│      600 │ $0.53     │ ~1.2 min  │    fine    │
│ 20 │    1,200 │ $1.06     │ ~2.3 min  │    fine    │
└────┴──────────┴───────────┴───────────┴────────────┘
Proceed? Enter Y, N, or a number to change K [Y]:

And then the result:

     TrustGate Certification Result
┌──────────────────────────┬───────┐
│ Reliability Level        │ 98.0% │
│ M* (at 95% confidence)   │ 1     │
│ Empirical Coverage       │ 1.000 │
│ Capability Gap           │ 0.0%  │
│ Status                   │ PASS  │
└──────────────────────────┴───────┘

Reliability Level: your AI's top answer is correct for 98.0% of
questions — the highest confidence with a formal guarantee.
M* = 1: at 95% confidence, the top answer alone is sufficient.

This is exactly the kind of output we wanted.

Compact. Operational. Legible.

Reliability Level is the headline.
M* tells us the certified prediction-set size.
Empirical Coverage tells us what happened on held-out data.
Conditional Coverage isolates performance where the model could actually solve the task.
Capability Gap tells us how often the correct answer never appeared in the sampled outputs at all.

That last metric matters more than it looks. There is a real difference between a system that is uncertain among plausible answers and a system that never surfaced the correct answer in the first place. Those are different failure modes, different interventions, and different product decisions.

If we want the more general black-box endpoint setup, TrustGate also supports this exact generic config pattern:

endpoint:
  url: "https://my-agent.example.com/api/ask"
  temperature: null
  request_template:
    query: "{{question}}"
  response_path: "answer"
  cost_per_request: 0.03      # measure this first from your billing

canonicalization:
  type: "llm"
  judge_endpoint:
    url: "https://api.openai.com/v1/chat/completions"
    model: "gpt-4.1-nano"
    api_key_env: "LLM_API_KEY"

We included this because TrustGate was never meant to be limited to direct model calls. It is built for agents, RAG pipelines, and custom APIs where the endpoint owns its own randomness. That is the deployment surface we cared about from the start.

If we need questions with labels, we keep them in a separate CSV file,

id,question,acceptable_answers
q001,"Capital of France? (A) London (B) Paris (C) Berlin (D) Madrid","B"
q002,"Largest planet? (A) Earth (B) Mars (C) Jupiter (D) Venus","C"

And when we do not have ground truth, the human-calibration path is:

trustgate calibrate --export questionnaire.html
# Share via email/Slack → reviewer opens in browser → downloads labels.json
trustgate certify --ground-truth labels.json

Or, if we want the faster but less rigorous path:

trustgate certify--auto-judge

We like that this is honest:

The automated route is faster.

The human route is stronger.

The tool makes the tradeoff visible instead of pretending there is a single perfect workflow.

How they work together

What makes TrustGate useful is that the pieces reinforce each other.

Self-consistency sampling gives us the signal. Canonicalization makes the signal comparable. Calibration turns answer profiles into something statistically meaningful. Conformal prediction turns that into a certified reliability statement.

That is the core loop.

But what makes TrustGate feel like infrastructure instead of a paper artifact is everything around that loop: question sourcing, human calibration, concurrency tuning, CI/CD gating, and runtime trust layers. We built it to be used in an operating environment, not just cited in one.

That distinction matters.

TrustGate is not just something we run once in a notebook and admire. It is a deployment gate. It can fail a rollout if reliability is below threshold. It can attach reliability metadata at runtime. It can become part of how a team decides whether an AI system is safe to ship, not just how it talks about safety after the fact.

Repos, papers, book

This is the ecosystem framing we care about.

The paper proves the research.

TrustGate: Black-Box AI Reliability Certification via Self-Consistency Sampling and Conformal Calibration as the theoretical foundation for the system.

The repo proves the code.

The GitHub repo expose the actual operator surface: install flow, YAML config, certification CLI, calibration options, question sourcing, runtime trust integration, and performance tuning.

The architecture proves the system.

The design is understandable enough to run, reason about, and integrate into release logic. That matters. Good AI infrastructure should survive contact with real deployment decisions.

FAQ

Does TrustGate work with any LLM?

It works with any OpenAI-compatible API and with custom HTTP endpoints for agents, RAG systems, and internal APIs. The README explicitly names OpenAI, Together, Ollama, LiteLLM, Azure OpenAI, vLLM, and Mistral as supported patterns, and shows how to use headers for non-standard auth.

How much does repeated sampling cost?

It depends on the number of questions, K, the endpoint cost, and concurrency, but TrustGate includes a pre-flight estimate before running. The README example shows 120 questions at K=10 with an estimated cost of $0.53 and notes that sequential stopping reduces requests by about 50%. For custom endpoints, you must provide cost_per_request.

Can we use TrustGate without ground truth?

Yes. You can export a shareable questionnaire, run a local review UI, or use --auto-judge for an automated path. The README presents human review as the recommended path when you do not already have correct answers.

How does this differ from standard eval suites?

Standard eval suites tell you how a system scored on a benchmark. TrustGate is built to certify the reliability of a black-box endpoint with a formal guarantee, and the README positions it as a deployment gate, including CI/CD fail conditions and runtime trust metadata.

Final takeaway

We think AI reliability needs a better standard than “it looked good in testing.”

TrustGate is our answer to that problem:

We built it to treat reliability as a certification problem, not a vibes problem.

We built it for the API boundary because that is where modern AI systems actually live.

We built it to produce a number teams can use, not just admire.

We built it so the output can influence a real shipping decision.

That is the standard we want for AI systems that are supposed to matter.

Not just clever outputs.

Not just convincing demos.

Systems we can ship, defend, and trust.

— Cohorte Team

DEV Community