DEV Community: lenin coronel

Synod Update: Adding a Deterministic Safety Net (and Proving It Helps)

lenin coronel — Wed, 08 Jul 2026 22:53:30 +0000

Quick update on Synod, the multi-agent code reviewer I've been building for
the Qwen Cloud hackathon.

What changed

I added a Semgrep pre-filter in front of the security agent. Before, every
finding came purely from the LLM reading the code and reasoning about it —
which works, but LLMs are stochastic. Same file, different run, sometimes a
different result.

Now Semgrep scans first with deterministic rules, and the security agent
validates and enriches those candidates instead of starting from zero every
time.

Did it actually help?

I was skeptical of my own change, so I benchmarked it properly instead of
assuming. Ran a single-agent baseline against the full council, with and
without the pre-filter, same vulnerable file, checked against known ground
truth:

MethodPrecisionRecallF1Single agent75%75%, but ranged 0–75% across runs75%Council, LLM-only75%same variance issue75%Council + Semgrep100%100%, every run100%

The interesting part wasn't the top-line numbers — it was that the
single-agent and LLM-only council both had real run-to-run variance.
Sometimes it caught everything, sometimes it missed half. That's not a
reviewer you can trust in CI.

Adding the deterministic scanner as a floor fixed that. It's not smarter,
it's just consistent — and consistency turned out to matter more than I
expected.

Also shipped

A GitHub webhook — open a PR, Synod reviews the diff and comments
directly, findings grouped by severity.
A small CLI, closer to how tools like Claude Code feel in the terminal,
for reviewing files or whole directories without touching the API
directly.

Repo's still open source: github.com/02NIN20/Synod

Built for the Global AI Hackathon Series with Qwen Cloud — Track 3: Agent
Society.

What Six Arguing AI Agents Taught Me About Building One That Actually Works

lenin coronel — Sat, 04 Jul 2026 18:23:06 +0000

I broke my own project on purpose, twice, before it worked. Here's the
story.

Round one: the debate club

My first idea for this hackathon sounded great in my head. Six AI agents,
each with a "role" — security, architecture, performance, whatever — and
they'd debate each other across multiple rounds before agreeing on a final
answer. Like a mini panel of experts arguing it out.

I built it. I ran it against some vulnerable test code. It came back with
127 findings.

I got excited for about four minutes. Then I actually read them.

Maybe three were real. The other 124 were the agents politely agreeing with
each other about problems that didn't exist, or restating the same bug five
different ways because five different agents happened to notice it.
Precision was somewhere around 2%. Worse than a single model working alone.

That stung a little, not going to lie. I'd spent days on the debate logic.

Round two: quieter, and better

So I ripped it apart. No more debate rounds. No more six agents shouting
over each other. I went down to four, gave each one exactly one job, and —
this is the part that actually fixed things — made them depend on each
other in order instead of all firing at once.

One agent maps out the code first. Two others use that map to look at
security and quality separately. A last one compares what they found,
throws out duplicates, and — importantly — actually checks the line numbers
against the real file instead of trusting the AI's word for it.

Same test file. This time: real vulnerabilities, correctly flagged, nothing
made up. Point it at clean code afterward and it correctly said nothing was
wrong, which honestly felt like a bigger win than finding the bugs did.

The annoying lesson

I wanted this project to feel impressive. More agents, more debate, more
"look how sophisticated this is." What actually worked was the boring
answer: fewer agents, clear roles, one checking the other's work instead of
everyone talking at once.

I named the final version Synod, after the idea of a council that actually
deliberates and reaches a verdict, instead of a crowd that just makes noise.

The version that's on GitHub today is the second architecture, not the
first. It's running on Alibaba Cloud, powered by Qwen, and includes a CLI so
I can review code, chat about a project, or scan an entire repo right from
the terminal. If you want to poke at it or roast my code, it's open source:
github.com/02NIN20/Synod

Built for the Global AI Hackathon Series with Qwen Cloud — Track 3: Agent
Society.

Synod: Teaching AI Agents to Actually Collaborate, Not Just Coexist

lenin coronel — Wed, 01 Jul 2026 08:06:00 +0000

The Question That Started It

Track 3 asked for a multi-agent system where agents work together through
"task division, dialogue, and negotiation." Most submissions I imagined would
spin up five agents, run them in parallel, and glue the outputs together with
a summary. That's not collaboration — that's five monologues in a trench coat.

I wanted agents that actually depended on each other. So I built Synod.

The Name

A synod is a council that convenes to deliberate and reach a verdict — not a
crowd shouting in parallel, but a structured body where each member has a
defined seat and a defined voice. That's the model I wanted for code review.

The Problem With "More Agents"

My first instinct — and my first mistake — was to throw more agents at the
problem. I benchmarked a single generalist model against a six-agent panel
with full debate rounds. The multi-agent version found 8-11x more "findings."

Then I checked precision. It had collapsed to near zero. The council wasn't
finding more real bugs — it was drowning three real vulnerabilities in a
hundred hallucinated ones, then calling it thoroughness. F1 score, the number
that actually matters, was worse than the single agent in most samples.

That result reshaped the whole project. Scale isn't the win condition.
Signal is.

The Architecture: Three Agents With a Reason to Talk

Synod runs three core agents, each with a role that doesn't overlap with the
others, chained so that context flows downstream instead of everyone
analyzing in a vacuum:

Cartographer goes first. It doesn't judge the code — it maps it: modules,
dependencies, entry points. Pure reconnaissance.

Inspector and Sentinel run next, in parallel, but both receive
Cartographer's map as context. Inspector hunts code quality issues —
complexity, anti-patterns, maintainability traps. Sentinel hunts security —
CWE-mapped vulnerabilities, from SQL injection to command injection to unsafe
eval().

Arbiter closes the loop. It doesn't just concatenate findings — it:

Deduplicates near-identical findings by title similarity
Validates every cited line number against the actual source, dropping anything a model hallucinated
Escalates severity when two agents independently flag the same issue — real corroboration, not vibes

For high-severity findings, an optional fourth agent — Smith — steps in
with a proposed fix. Sentinel then re-reviews Smith's fix before it's
accepted, in a short bounded loop (max 2 iterations). Generate, critique,
refine — not a free-for-all debate, a targeted second opinion where it
actually earns its cost in tokens.

This is the pattern the literature on agent orchestration keeps pointing to:
prompt chaining for deterministic dependency, evaluator-optimizer for the one
step that benefits from iteration. Not a debate club. A pipeline with
judgment built in.

What Changed My Mind Mid-Build

I originally planned four debate rounds — individual analysis, cross-debate,
refinement, negotiation — modeled on how human review panels argue. I
scrapped it. The rounds added latency and token cost without adding
precision; agents mostly restated each other's points or, worse, converged
on a wrong consensus because three models arguing don't automatically
out-reason one correct observation. Sycophancy cascades are real in
multi-agent systems — agents tend to agree with the majority, not the truth.

Cutting the debate and replacing it with structural dependency (Cartographer
→ Inspector/Sentinel) and evidence validation (Arbiter checking real line
numbers) did more for output quality than any amount of extra dialogue.

Deploying on Alibaba Cloud

Synod runs as a single FastAPI service in Docker on an Alibaba Cloud ECS
instance, talking to Qwen Cloud through its OpenAI-compatible endpoint
(qwen3-coder-plus for review, chosen specifically because it's a
code-specialized model rather than a general one). One container, one
docker-compose up, one exposed port. No database cluster to babysit before
a demo — working memory lives in a per-session dict, which is honest about
what a hackathon build needs: fast to deploy, nothing to migrate, nothing to
lose sleep over the night before judging.

What a Real Review Looks Like

Point Synod at an intentionally vulnerable sample and it comes back with
Sentinel catching hardcoded credentials (CWE-798), SQL injection (CWE-89),
command injection via both os.system and subprocess(shell=True)
(CWE-78), and unsafe eval() (CWE-94) — while Inspector independently flags
the SQL injection from a code-quality angle, and Arbiter treats that overlap
as corroboration and escalates it. That's the collaboration piece actually
doing something: two different lenses converging on the same real problem,
not two agents padding a findings count.

What I'd Tell Past Me

Precision before scale. A ten-agent system that's 95% noise is worse
than a three-agent system that's mostly signal. Measure F1, not finding
count.

Give agents a reason to depend on each other. Parallel agents with no
shared context aren't a society, they're a queue. Structural dependency —
one agent's output becoming another's input — is what makes "multi-agent"
mean something instead of being a buzzword on the tin.

Bound your loops. Iteration helps exactly where verification is cheap
and drift is a real risk. Everywhere else, it's just more tokens for the
same answer.

Simple deploys win demos. No cluster, no orchestration, no 2am migration
panic. One container that starts in fifteen seconds beats an impressive
architecture diagram that doesn't survive a live run.

Repository

Open source, MIT licensed:
github.com/02NIN20/Synod

Built for the Global AI Hackathon Series with Qwen Cloud — Track 3: Agent
Society.