lenin coronel

Posted on Jul 1 • Edited on Jul 4

Synod: Teaching AI Agents to Actually Collaborate, Not Just Coexist

#alibabachallenge #ai #agents #hackathon

The Question That Started It

Track 3 asked for a multi-agent system where agents work together through
"task division, dialogue, and negotiation." Most submissions I imagined would
spin up five agents, run them in parallel, and glue the outputs together with
a summary. That's not collaboration — that's five monologues in a trench coat.

I wanted agents that actually depended on each other. So I built Synod.

The Name

A synod is a council that convenes to deliberate and reach a verdict — not a
crowd shouting in parallel, but a structured body where each member has a
defined seat and a defined voice. That's the model I wanted for code review.

The Problem With "More Agents"

My first instinct — and my first mistake — was to throw more agents at the
problem. I benchmarked a single generalist model against a six-agent panel
with full debate rounds. The multi-agent version found 8-11x more "findings."

Then I checked precision. It had collapsed to near zero. The council wasn't
finding more real bugs — it was drowning three real vulnerabilities in a
hundred hallucinated ones, then calling it thoroughness. F1 score, the number
that actually matters, was worse than the single agent in most samples.

That result reshaped the whole project. Scale isn't the win condition.
Signal is.

The Architecture: Three Agents With a Reason to Talk

Synod runs three core agents, each with a role that doesn't overlap with the
others, chained so that context flows downstream instead of everyone
analyzing in a vacuum:

Cartographer goes first. It doesn't judge the code — it maps it: modules,
dependencies, entry points. Pure reconnaissance.

Inspector and Sentinel run next, in parallel, but both receive
Cartographer's map as context. Inspector hunts code quality issues —
complexity, anti-patterns, maintainability traps. Sentinel hunts security —
CWE-mapped vulnerabilities, from SQL injection to command injection to unsafe
eval().

Arbiter closes the loop. It doesn't just concatenate findings — it:

Deduplicates near-identical findings by title similarity
Validates every cited line number against the actual source, dropping anything a model hallucinated
Escalates severity when two agents independently flag the same issue — real corroboration, not vibes

For high-severity findings, an optional fourth agent — Smith — steps in
with a proposed fix. Sentinel then re-reviews Smith's fix before it's
accepted, in a short bounded loop (max 2 iterations). Generate, critique,
refine — not a free-for-all debate, a targeted second opinion where it
actually earns its cost in tokens.

This is the pattern the literature on agent orchestration keeps pointing to:
prompt chaining for deterministic dependency, evaluator-optimizer for the one
step that benefits from iteration. Not a debate club. A pipeline with
judgment built in.

What Changed My Mind Mid-Build

I originally planned four debate rounds — individual analysis, cross-debate,
refinement, negotiation — modeled on how human review panels argue. I
scrapped it. The rounds added latency and token cost without adding
precision; agents mostly restated each other's points or, worse, converged
on a wrong consensus because three models arguing don't automatically
out-reason one correct observation. Sycophancy cascades are real in
multi-agent systems — agents tend to agree with the majority, not the truth.

Cutting the debate and replacing it with structural dependency (Cartographer
→ Inspector/Sentinel) and evidence validation (Arbiter checking real line
numbers) did more for output quality than any amount of extra dialogue.

Deploying on Alibaba Cloud

Synod runs as a single FastAPI service in Docker on an Alibaba Cloud ECS
instance, talking to Qwen Cloud through its OpenAI-compatible endpoint
(qwen3-coder-plus for review, chosen specifically because it's a
code-specialized model rather than a general one). One container, one
docker-compose up, one exposed port. No database cluster to babysit before
a demo — working memory lives in a per-session dict, which is honest about
what a hackathon build needs: fast to deploy, nothing to migrate, nothing to
lose sleep over the night before judging.

What a Real Review Looks Like

Point Synod at an intentionally vulnerable sample and it comes back with
Sentinel catching hardcoded credentials (CWE-798), SQL injection (CWE-89),
command injection via both os.system and subprocess(shell=True)
(CWE-78), and unsafe eval() (CWE-94) — while Inspector independently flags
the SQL injection from a code-quality angle, and Arbiter treats that overlap
as corroboration and escalates it. That's the collaboration piece actually
doing something: two different lenses converging on the same real problem,
not two agents padding a findings count.

What I'd Tell Past Me

Precision before scale. A ten-agent system that's 95% noise is worse
than a three-agent system that's mostly signal. Measure F1, not finding
count.

Give agents a reason to depend on each other. Parallel agents with no
shared context aren't a society, they're a queue. Structural dependency —
one agent's output becoming another's input — is what makes "multi-agent"
mean something instead of being a buzzword on the tin.

Bound your loops. Iteration helps exactly where verification is cheap
and drift is a real risk. Everywhere else, it's just more tokens for the
same answer.

Simple deploys win demos. No cluster, no orchestration, no 2am migration
panic. One container that starts in fifteen seconds beats an impressive
architecture diagram that doesn't survive a live run.

Repository

Open source, MIT licensed:
github.com/02NIN20/Synod

Built for the Global AI Hackathon Series with Qwen Cloud — Track 3: Agent
Society.

DEV Community