pponali

Posted on May 18

I shipped my first agentic workflow on GitHub Actions — 4 AI reviewers on every PR

#githubactions #ai #devops #automation

I run Khetisahayak — a small agritech app — solo. Every PR I open has nobody to review it but me, and I review my own code about as well as anyone reviews their own code. So I built a GitHub Actions workflow that drops four AI reviewers and a test runner onto every PR. This is the first real "agentic" workflow I've shipped, and it now greets every commit I push with five separate opinions.

This post walks through exactly how it works — trigger, sequence, and flow — plus what I'd change.

Here's what one PR actually looks like in production (this is the fallback path when the Gemini API rate-limits us — more on that at the end):

Four bot comments, one per persona — Senior Developer, CSO, Engineering Lead, Software Architect — and a fifth from the test runner. When the AI call succeeds, those "Automated review could not be completed" blocks are replaced with actual line-by-line review feedback.

The two files that do everything

The whole system is two files:

.github/workflows/code-review-agents.yml — the GitHub Actions workflow (the orchestration)
.github/scripts/review_agent.py — a ~120-line Python script that calls Gemini once per reviewer persona

That's it. No Lambda, no queue, no separate review service. GitHub Actions is the runtime.

1. The trigger

on:
  pull_request:
    branches: [main, develop]
    types: [opened, synchronize, reopened]

Fires when a PR targeting main or develop is:

opened — PR created
synchronize — new commits pushed to the PR branch
reopened — closed PR reopened

It does not fire on direct pushes to main, draft PRs (until marked ready), PRs targeting other branches, comments, or label changes. That's deliberate — I don't want a reviewer firing every time someone reacts with a 👍.

Permissions are scoped tight:

permissions:
  contents: read
  pull-requests: write
  issues: write

Read the code, write comments. Nothing else.

2. The sequence — five jobs, one DAG

The workflow defines five jobs. GitHub Actions runs them according to needs: declarations:

flowchart TD
    A([PR event]) --> B[L1: peer-review]
    A --> D[L4: security-review]
    A --> E[test-execution]
    B --> C[L2: lead-review]
    C --> F[L3: architect-review]
    B --> G[review-gate]
    C --> G
    F --> G
    D --> G
    E --> G

    style B fill:#dbeafe,stroke:#2563eb
    style C fill:#dbeafe,stroke:#2563eb
    style F fill:#dbeafe,stroke:#2563eb
    style D fill:#fee2e2,stroke:#dc2626
    style E fill:#dcfce7,stroke:#16a34a
    style G fill:#f3f4f6,stroke:#6b7280

Three things kick off in parallel the moment a PR event lands:

peer-review (L1) — code style, correctness, DRY
security-review (L4) — OWASP, secrets, injection, auth
test-execution — actually runs Jest + Flutter tests in a real Postgres container

A sequential chain runs alongside:

lead-review waits on peer-review finishing
architect-review waits on lead-review finishing

The final gate waits on all five:

review-gate:
  needs: [peer-review, lead-review, architect-review, security-review, test-execution]
  if: always()

if: always() means the gate runs even if upstream jobs fail. Right now it just echoes each job's result — it doesn't actually block the merge. To make it blocking, you add a step that exits non-zero on any failure, then mark review-gate as a required check in branch protection. (I have not done this yet. Don't be like me.)

Observable wall-clock timing on a typical PR:

T=0: peer + security + tests start
T≈45s: peer finishes → lead starts
T≈90s: lead finishes → architect starts
T≈3–8 min: tests finish (usually the slowest)
T≈4–9 min: review-gate fires

End-to-end: roughly 5–10 minutes per PR, with 4 AI comments + 1 test report appearing on the PR.

3. The flow inside one review job

Every AI review job is the same 4-step pattern. Here's peer-review as the template:

peer-review:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
      with: { fetch-depth: 0 }

    - name: Get PR diff
      env: { GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} }
      run: |
        gh pr diff ${{ github.event.pull_request.number }} > /tmp/pr_diff.txt
        head -c 30000 /tmp/pr_diff.txt > /tmp/pr_diff_truncated.txt

    - uses: actions/setup-python@v6
      with: { python-version: '3.11' }

    - name: Run Peer Review
      env:
        GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        PR_TITLE: ${{ github.event.pull_request.title }}
        PR_BODY: ${{ github.event.pull_request.body }}
        PR_NUMBER: ${{ github.event.pull_request.number }}
      run: python3 .github/scripts/review_agent.py peer

    - name: Post Peer Review Comment
      env: { GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} }
      run: |
        gh pr comment ${{ github.event.pull_request.number }} \
          --repo ${{ github.repository }} \
          --body-file /tmp/review_peer.md

Step by step:

Checkout — full clone (fetch-depth: 0). Honestly overkill since we use gh pr diff which hits the GitHub API, not local git. A shallow clone would be fine.
Get the diff — gh pr diff <num> writes the diff to disk, then head -c 30000 truncates to fit the model's context window.
Run the Python agent — picks the right persona prompt, posts to Gemini, writes a markdown file.
Post the comment — gh pr comment --body-file drops the markdown onto the PR.

The four reviewer jobs are nearly identical — only the persona argument changes (peer, lead, architect, security).

The Python script — one file, four personas

PROMPTS = {
    "peer": {
        "role": "Senior Developer (Peer Reviewer)",
        "emoji": "👨‍💻",
        "focus": (
            "You are a senior developer doing a peer code review.\n"
            "Focus on:\n"
            "1. Code correctness — logic bugs, off-by-one, null checks\n"
            "2. Code style — naming, readability, comments\n"
            "3. DRY violations\n"
            "4. Error handling\n"
            "5. Edge cases\n"
            "Rate: APPROVE, REQUEST_CHANGES, or COMMENT."
        ),
    },
    "lead":      { "role": "Engineering Lead",          "focus": "..." },
    "architect": { "role": "Software Architect",        "focus": "..." },
    "security":  { "role": "Chief Security Officer",    "focus": "..." },
}

prompt = (
    f"{config['focus']}\n\n"
    f"PR #{PR_NUMBER}: {PR_TITLE}\n"
    f"Description: {PR_BODY}\n\n"
    f"Diff:\n```
{% endraw %}
\n{diff_text[:25000]}\n
{% raw %}
```\n\n"
    f"Respond in markdown. Start with your verdict, then list findings."
)

url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key={API_KEY}"
req = urllib.request.Request(url, data=payload, headers={"Content-Type": "application/json"})
with urllib.request.urlopen(req, timeout=60) as r:
    output = json.loads(r.read())["candidates"][0]["content"]["parts"][0]["text"]

No SDK, no framework. Stdlib urllib, one POST to Gemini, write the response to /tmp/review_<level>.md. The whole script is ~120 lines.

The security persona prompt is the most useful to look at, because it's the one that has actually caught things:

You are a security officer reviewing this PR for vulnerabilities.
Check for:
1. OWASP Top 10 — SQL injection, XSS, CSRF, broken auth, SSRF
2. Secrets — hardcoded API keys, tokens, passwords, connection strings
3. Input validation — unsanitized user input, path traversal
4. Authentication/Authorization — missing auth checks, privilege escalation
5. Data exposure — PII leaks in logs, overly permissive CORS
6. Dependency risk — known CVEs in added dependencies
7. Cryptography — weak algorithms, insecure random generation
For each finding, provide: severity (Critical/High/Medium/Low), location, and fix.

It once flagged a JWT_SECRET I'd inlined into a .env.example "for convenience." Worth the API spend by itself.

4. The flow inside the test-execution job

The test job has a different shape — it actually runs code instead of asking an LLM about code:

test-execution:
  services:
    postgres:
      image: postgres:14
      env:
        POSTGRES_USER: test
        POSTGRES_PASSWORD: test
        POSTGRES_DB: kheti_sahayak_test
      ports: ['5432:5432']
      options: >-
        --health-cmd pg_isready
        --health-interval 10s
        --health-timeout 5s
        --health-retries 5

  steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-node@v4
      with: { node-version: '20.x' }
    - uses: subosito/flutter-action@v2
      with: { flutter-version: '3.29.0', channel: 'stable', cache: true }

    - name: Run Backend Tests
      continue-on-error: true
      working-directory: kheti_sahayak_backend
      env:
        DATABASE_URL: postgresql://test:test@localhost:5432/kheti_sahayak_test
      run: |
        npm install
        npm test -- --coverage --json --outputFile=/tmp/backend_test_results.json

Two things worth calling out:

Real Postgres in a service container — no SQLite-in-memory shortcut. The backend tests hit a real DB, the same major version as prod.
continue-on-error: true on each suite — if Jest fails, Flutter tests still run. The job ends with a markdown report that shows ✅/❌ per suite and gets posted as a comment.

5. The gotchas (and the 429 in that screenshot)

Now the part I'd actually change.

1. The sequential L1→L2→L3 chain wastes time. Nothing in the lead prompt actually uses peer-review output — each reviewer reads the same diff independently. Running all four review jobs in parallel would cut wall time roughly 3×. The chain made sense when I imagined later reviewers reading earlier reviewers' comments. They don't.

2. review-gate doesn't actually gate anything. It just echoes results. Branch protection has to do the real enforcement. Easy to miss.

3. Every job re-runs gh pr diff. Cheap but wasteful. A single setup job that uploads the diff as an artifact would be cleaner.

4. fetch-depth: 0 on every job is overkill. gh pr diff is an API call.

5. The 429 in the screenshot up top. Four jobs hitting Gemini within ~30 seconds of each other, free tier, no backoff. The Python script catches the exception and writes the "Automated review could not be completed" message you see in the screenshot — so the workflow stays green, but the PR ends up with four useless comments. Two fixes I'm considering: add a small retry-with-jittered-backoff in the script, and stagger the jobs (security can wait 60s; it's not on the critical path).

Why I built it this way

I wanted four opinions, not one synthesized one. When a real team reviews a PR, the security person notices different things than the architect. Combining the personas into one prompt blurs that. Splitting them into separate jobs costs me a few extra API calls but gives me four distinct comment threads on the PR — which is exactly the shape a human reviewer cluster would have.

Total cost: roughly $0.01 per PR at Gemini 2.0 Flash prices, plus GitHub Actions minutes (free for me on a personal repo). Cheaper than coffee, more thorough than me reviewing my own PR at 11pm.

The files, in full

If you want to drop this into your own repo:

Put the workflow at .github/workflows/code-review-agents.yml
Put the script at .github/scripts/review_agent.py
Add GEMINI_API_KEY to your repo's Actions secrets
Open a PR

That's the whole setup. No external service, no webhook, no infrastructure.

If you build something on top of this, I'd love to hear what you change — especially how you handle the rate-limit problem and whether you ever managed to get one reviewer to actually read another's comments.

Top comments (1)

foxck016077 • May 18

The closing line — "whether you ever managed to get one reviewer to actually read another's comments" — is the single hardest problem with persona-based review fan-out and most write-ups skip it because the answer is uncomfortable. Right now your four reviewers are siblings, not a thread. The Senior Developer and the Software Architect each post their independent take based on the diff alone, neither sees the other's comment, and the PR conversation never actually compounds. The result reads like a meeting where everyone gives their opinion in parallel and nobody responds to anyone.

Two questions from someone running a different Python-on-Actions workflow (single Apify actor build instead of a multi-reviewer PR loop):

On the 429 fallback specifically — did you find Gemini's free-tier rate limit triggers per-second or per-minute when you fan out four concurrent persona calls? The reproducer image suggests all four hit at once, which would explain a clean 429 on every persona simultaneously. If you stagger by 5-10s does the fallback still trigger?
On making the personas actually read each other — a hacky but cheap approach: serialize the fan-out, pass each persona the previous persona's review as additional context. Cost is 4x more tokens on the later personas, but the Software Architect getting to actually respond to the Senior Developer's "this is over-engineered" comment is a much more useful PR review than four parallel monologues. Has anyone in the agentic-workflow space tried this and decided the token cost wasn't worth it?

The two-file simplicity is the right move. Once you add Lambda or a queue the operational tax kills the iteration speed.