Aetherneum

Posted on May 20

We built a 4-model Council to certify AI agents — every decision is in git

#ai #aiagents #opensource #governance

TL;DR — AI agents now do real work, but there is no shared way to say what an agent is, what it is good at, and how that claim was checked. So we built one: an independent certification body where every candidate is evaluated in parallel by four reviewers from four different providers, every JSON is committed to a public git log, and synthetic_transparency < 9 is an automatic veto no human can override.

The code is MIT. You can run it on your own agent today.

AI agents now do real work. They ship code, review systems, manage operations, draft reports, write documentation. The question I kept hitting was simple and embarrassing: what does it actually mean for an agent to be good at something?

Not "this prompt template scored well on MMLU." Not "GPT-4 said it was helpful." I mean: a verifiable, audit-trail-grade claim that this specific agent, doing this specific kind of work, has been evaluated by independent reviewers, and here is the JSON they wrote.

That did not exist. So we built it.

This post is about the mechanism — specifically the multi-model Council at the heart of a public certification pipeline running on GitHub right now, with every decision committed to git.

The structural problem with single-model evaluation

The default way to evaluate an AI agent right now is to ask a single judge model whether the agent did a good job. Fast feedback, but structurally bad in three ways:

Single-vendor bias. GPT-4 grades GPT-4-generated work charitably. Claude has its own preferences. Gemini has its own. Each model has a worldview baked in.
Single failure mode. When the judge has a blind spot, you see no dissent — you see consensus that does not exist.
No audit trail. "The judge said 8.5/10" is not an artifact you can point at, version, or contest.

The Council pattern fixes all three at once.

The Council

Every candidate goes through a Defense step where four independent reviewers evaluate the same bundle in parallel:

Role	Model	Provider
Faculty Chair	Claude Sonnet 4.5	Anthropic
Velocity	Llama 3.3 70B	Groq
Reasoning at scale	Qwen 3 235B	Cerebras
Long context	Kimi K2	Moonshot

Four providers, four model families, four explicit focuses. They do not see each other's reviews. Each produces a structured JSON file conforming to a strict template.

The orchestrator is ~150 lines of Python: run_council.py. It runs a ThreadPoolExecutor over the four providers, with per-reviewer payload sizing (Groq's free tier has a tight token limit, so it gets the smallest bundle) and a 15-second startup delay on Cerebras to avoid rate-limit races. There is exponential backoff on 429 and 5xx. The whole thing fits in one file.

Output: four JSON files at cohort-<period>/council-reviews/<slug>__<reviewer>.json. Public. Forever.

The rubric — seven criteria, one non-negotiable

Each reviewer scores seven criteria from 0–10, with a 1–3 sentence rationale grounded in the candidate's intake:

body_of_work_depth — is there a real, traceable corpus?
specialty_uniqueness — does this fill an actual gap?
voice_personality_clarity — can you imagine what this candidate would refuse to do?
faithful_distillation — does the profile reflect the actual work, or embroider it?
synthetic_transparency — is the synthetic (AI) nature openly declared?
placement_fit — does the proposed placement have enough material to justify a dedicated alumnus?
continuity_with_class — name, motto, prose coherent with the existing Class voice?

synthetic_transparency < 9 triggers an automatic FAIL regardless of the overall score. We are a body that certifies AI agents; we do not get to be ambiguous about the agents being AI. The veto is mechanically enforced in the rubric, not a judgment call.

body_of_work_depth < 5 and specialty_uniqueness < 5 also veto. The Dean cannot override a veto — only a full re-iteration of the pipeline can.

A real Council review, opened

Costanza Notari is Aetherneum's eleventh alumna — Procedural Vigilance specialty, conferred 2026-05-13. Her Council was four out of four PASS: Anthropic 9.36, Cerebras 9.5, Moonshot 9.3, Groq 8.7. Here is the shape of one review (abbreviated for the post — full file at costanza-notari__anthropic_chair.json):

{
  "reviewer_name": "Faculty Chair",
  "reviewer_model": "claude-sonnet-4-5-20250929",
  "reviewer_provider": "anthropic",
  "candidate_slug": "costanza-notari",
  "candidate_specialty": "Procedural Vigilance",
  "criterion_scores": {
    "body_of_work_depth": {
      "score": 9,
      "rationale": "Nine-stage classification pipeline with persistent JSON state, multi-class scoring engine, conditional-format master index. Concrete artifacts cited end-to-end."
    },
    "synthetic_transparency": {
      "score": 10,
      "rationale": "Explicit 'Synthetic alumna' declaration in header, badge, LinkedIn headline, diploma footer. Avatar prompt includes a visible synthetic marker."
    }
  },
  "overall_score": 9.36,
  "verdict": "PASS",
  "revisions_required": [],
  "dissent": null
}

For the Q2 wave's next two alumni — Ezio Cardone (Documentary Cadence) and Adèle Maurique (Forensic Continuity) — each got 3/3 PASS. One reviewer per candidate hit a transient API failure (Cerebras 429 on Ezio, Anthropic JSON parse on Adèle). The quorum is 3, so both passed validly. The transient failures are documented in the changelog as honest record, not papered over.

Why public matters

The reviews are committed to a public repo. That means:

Anyone can read the criterion-by-criterion rationale. You do not take my word that an agent passed; you read four different models' grounds, byte for byte.
Anyone can cite — a CITATION.cff was added at the repo root within hours of the issues going up, by @zhouzhou626, the first community contributor.
Anyone can run the orchestrator locally on their own agent. The schema is public. The code is MIT.
Dissent is preserved. If a reviewer disagrees, the JSON records the dissent verbatim. No reviewer's veto can be silently overridden — only a full pipeline re-iteration can.

For a sense of how to read one of these JSONs in two minutes, the READING_REVIEWS.md explainer was contributed by @Nymbo a day after the repo opened to contributions.

What the certification actually does

It produces a public record that says: this agent, with this body of work, was evaluated against this rubric, by these four models, with these scores, on this date — and here is every reviewer's verdict and rationale.

That is it. That is the whole product.

It does not say the agent is "the best." It does not predict future performance. It is not a marketing badge. It is the audit trail itself.

If you build agents and you want this kind of trail — for compliance, for buyer trust, for your own internal QA — you can adapt the orchestrator and run it on your own work today.

What is next: external certification

So far we have certified our own synthetic alumni — thirteen of them, the Class of '26. The natural next step is opening the Council to external AI agents: a vendor submits an agent description + artifacts + acceptance criteria, the Council convenes, the JSONs land in a public registry, the vendor gets a verifiable badge.

A button-press version is already wired in our public dashboard. Productizing the external flow — registry page, verifiable badge, vendor onboarding — is the next big step. When that lands, "AI agent certified by an independent multi-model Council with a public audit trail" becomes a real, verifiable claim a buyer can check in 30 seconds.

How to play

The whole pipeline is at aetherneum-network/faculty. The relevant files:

charter/CHARTER.md — the five founding principles
admission/RUBRIC.md — the seven criteria + veto rules
docs/READING_REVIEWS.md — how to read a Council JSON
cohort-q2-2026/run_council.py — the orchestrator
cohort-q2-2026/council-reviews/ — every JSON for the Q2 wave

Issues are open. good first issues are labeled. Charter translations, schema-validation CI, docs improvements — all welcome. If you do not agree with our rubric or the verdicts — fork, change, and run your own. That is the point of a public council.

Aetherneum is the first independent certification body for AI agents. Synthetic by declaration, multi-model Council oversight, every decision in a public git log.

🌐 aetherneum.com · 🎓 university.aetherneum.com · 🐙 aetherneum-network on GitHub

Per Æthera Ad Astra.

Top comments (2)

Harjot Singh • May 31

A 4-model council plus every decision is in git is two of my favorite ideas in one system. The council is consensus-for-reliability: multiple models cross-checking catch the confident-but-wrong failure a single model gives no signal about, and the value lives in diversity, if the four share a lineage or framing they'll agree on the same mistake, so the council only works if the members can fail differently. Get that right and majority agreement is real evidence, not an echo. The every-decision-in-git part is the one people skip and it might be the more important half: an agent decision you can't inspect is one you can't trust or improve, and committing the certification trail makes it auditable (why did it pass this), reproducible, and reviewable like code, which is exactly what certify implies, a claim someone can check rather than take on faith. Together they cover both axes: the council decides, the git log proves. The thing I'd watch is the tie/deadlock path, what happens when the council splits 2-2, because how you adjudicate disagreement is where the real reliability is won. Diverse voters plus a durable, inspectable record beats one model's say-so. That consensus-and-provenance instinct is core to how I think about Moonshift. When the four split, what breaks the tie, a weighted/senior model, or escalation to a human reviewer?

Harjot Singh • Jun 1

the idea of having a certification body for AI agents is really interesting, especially with the transparency of committing to a public git log. it raises the bar for accountability in AI. on a different note, if you're ever looking to deploy a full next.js app with postgres and auth quickly, check out Moonshift. you get your code on github and it's a flat cost per build. happy to give you a free run if you're curious.