Vadym Arnaut

Posted on Jun 9 • Edited on Jul 9

I ran 4 AI agents on yesterday's PRs. Two real security bugs surfaced.

#ai #security #testing #python

After every coding session I run a 4-agent parallel audit on the diff I just shipped. A recent session of mine was seven PRs landing a new daily-challenge feature on my open-source LMS. Two of the audit findings were real security or integrity bugs that my human review missed. This is the playbook.

The four agents

I split the audit into four narrow roles. Narrow because a generalist agent tells you everything is fine; a specialist with a clear mandate tells you what is broken.

Cleanup agent. Looks for leftover patterns from the work just done: dead references to removed roles, unused i18n keys, orphan test fixtures, dual-named scripts where one should have died.
Security agent. Auth, tokens, rotation, CSP, RLS, secrets in code, error messages that leak structure. Treats every new endpoint as hostile until proven otherwise.
Test integrity agent. Reads each test that was added or changed and asks the harder question: would this test fail if the code under test was wrong? Hunts for tautologies, tests that pass by insertion order, assertions that match the implementation instead of the requirement.
Live prod verification agent. Pulls actual production data, hits the new endpoints with real auth, confirms the change reached prod and behaves like the diff implies.

Each agent runs in parallel with a 15-20 minute budget, not a 6-minute skim. The output is a findings list, severity-tagged, with file:line references and proposed fixes.

The audit's haul: two real bugs

Critical: PyJWT `decode` does not validate `iat`

The diff included an iCal subscription feature with a 365-day token TTL and a "rotate token" button. Rotation generated a new JWT with a fresh iat and updated nothing else. The implicit promise was that the old token would stop working.

It did not. The security agent flagged this in one line: "pyjwt.decode does not validate iat by default. Old tokens with the original iat remain valid for the full 365-day TTL after rotation."

# Before — rotation does nothing useful
payload = jwt.decode(token, SECRET, algorithms=["HS256"])
# iat is in the payload but never compared to anything

# After — refuse any token issued before the user's rotation floor
floor = current_user.calendar_ical_min_iat
payload = jwt.decode(token, SECRET, algorithms=["HS256"])
if floor is not None and payload["iat"] < floor:
    raise InvalidToken()

Schema change: one new column on profiles, calendar_ical_min_iat BIGINT NULL. Rotation writes the new token's iat to that floor. Verification refuses any token whose iat is below it. Three tests pin the contract: rotation invalidates the old token, a token signed with a different secret fails, the floor read works on first rotation.

PyJWT validates exp and nbf by default. It does not validate iat. Even passing options={"verify_iat": True} only checks that iat is a valid timestamp, not that it is recent enough. If your auth design assumes "rotation invalidates everything older," you have to enforce it yourself.

High: a test that was passing by insertion order

The test integrity agent caught a test named test_prefers_live_over_archive_attempt. The function under test ordered results by is_archive ASC. The test inserted the live attempt first, then the archive attempt, then asserted the live one came back.

That assertion passed for the wrong reason. SQLite returned rows in insertion order absent an effective ORDER BY. If the ORDER BY had been deleted from the production code, the test would have stayed green. The agent flagged this as "tautological under the current backend; rewrite so only the ORDER BY can make the test pass." I reversed the insertion order. The test passes for the right reason now.

This is the finding that scared me most. The other one was a security bug, but at least it was a bug I would have learned from when it bit. A test that passes for the wrong reason is a bug you never learn from, because it tells you everything is fine while the implementation rots underneath.

Why specialist agents beat one generalist

The first version of this workflow used one agent with a long prompt that said "find security issues, cleanup issues, test issues, and verify against prod." The output was a list of fifteen "potential concerns" that read like a checklist someone copied from a blog post. Half were false positives, none were severity-ranked, and the actually-critical PyJWT thing was buried at item nine.

Splitting the same audit into four specialists with narrow mandates and clear severity rubrics changed the output. The cleanup agent comes back with "zero items" when there is nothing, instead of inventing four items to look thorough. The security agent comes back with two findings, both real, both severity-tagged, both with file:line refs. The test integrity agent finds the tautology that nobody who wrote the test would ever see, because the human author is exactly the wrong person to ask whether the test would catch the bug.

What this is not

This is not a code-review bot replacing human review. It runs after the human review and merge, on the change as it exists in main. It is not a substitute for thinking about the diff while you write it. It is the second pass that catches what familiarity blinds you to, on the surface area the diff actually touches.

If a human review took 30 minutes and missed the PyJWT thing, a four-agent audit that catches it is the cheapest security investment in the workflow. The cost is the agent budget. The value is the bug that would have been a year-long production leak.

I will not be writing production code without the four-agent post-session audit again.

Equip is open source under MIT at github.com/ArVaViT/equip. PR #626 closes the bugs above and adds the defensive tests.

ArVaViT / equip

Free, open-source LMS for Bible schools, ministries, and nonprofit educational programs. React + FastAPI + Supabase.

Equip

A free, open-source learning management system built for Bible schools church ministries, and nonprofit educational programs

Live demo · Roadmap · Contributing · Support · Changelog

Screenshots

_{Sign in (light)}	_{Sign in (dark)}
_{Account creation — student / teacher role picker}	_{Mobile (390px)}

Live at equipbible.com. Teacher and admin views (gradebook, course editor, analytics) are behind sign-in — create a free account to explore.

Why this project?

Hundreds of small Bible schools, home churches, and missionary training programs around the world still manage courses on paper, WhatsApp, or spreadsheets. Commercial LMS platforms are expensive, overkill, or require technical expertise that volunteer-run organizations simply don't have.

Equip is designed to change that:

Free forever — MIT-licensed, no paywalls, no "premium" tiers.
Simple to deploy — one-click Vercel deploy with a free Supabase database. No Docker, no servers to manage.
Built for small scale — optimized for 20-100 students, not…

View on GitHub

Top comments (3)

Alex Shev • Jun 12

This is the kind of agent use case that makes sense to me: narrow scope, recent diffs, concrete findings, and a human review path. Agents are much better as extra reviewers over a bounded change than as vague “secure my whole app” systems.

Vadym Arnaut • Jun 17

@alexshev bounded scope is exactly why it caught anything. One of the two was an RLS grant leaking is_correct on our quiz_options table to any authenticated user through the browser's anon key. Backend code was clean - a "secure my whole app" pass would've sailed right past it. It only showed because the agents diffed the actual Postgres grants against that day's migration. With a public anon key, the grants are the security boundary, not the API layer.

Alex Shev • Jun 18

That RLS example is exactly why the bounded scan worked. The dangerous boundary was not the backend route, it was the database grant exposed through the browser key.

That is a good pattern for agent security reviews: give the agent a narrow artifact class and ask it to compare actual deployed permissions against the intended model. Broad security review would probably miss it.