---
title: "Agentic Code Review with Qwen3.6-35B-A3B on Your Local GPU"
published: true
description: "A hands-on guide to self-hosting Qwen3.6-35B-A3B for agentic code review in CI — covering quantization, serving, constrained decoding, and GitHub Actions integration at zero API cost."
tags: devops, architecture, cicd, performance
canonical_url: https://blog.mvpfactory.co/agentic-code-review-qwen3-6-35b-a3b-local-gpu
---
## What We're Building
Let me show you how to run a full agentic code review gate on your own hardware using the Qwen3.6-35B-A3B mixture-of-experts model. By the end of this tutorial, you will have a self-hosted GitHub Actions workflow that reviews every PR diff with a locally served LLM, outputs structured JSON verdicts, and gates merges — all at zero API cost. The model ships under Apache 2.0, so commercial CI use is fine.
## Prerequisites
- A workstation GPU with 16-32 GB VRAM (RTX 4090, A5000, or A6000 Ada)
- `llama.cpp` compiled with CUDA support (or a vLLM installation)
- A GitHub repository with Actions enabled and a self-hosted runner configured
- Familiarity with GGUF quantization formats and basic CI/CD concepts
## Step 1: Understand Why This Works Now
The real blocker to AI-assisted code review was never model quality. It was cost predictability and data sovereignty. Sending every diff to a cloud API at $3–15 per million tokens adds up fast when your team pushes 50+ PRs a day, and plenty of organizations flat-out cannot send proprietary code to third-party endpoints.
Qwen3.6-35B-A3B makes self-hosting realistic. As a mixture-of-experts architecture, it activates only ~3B of its 35B parameters per forward pass, so inference fits on hardware that would choke on a dense 35B model. The model was built for agentic coding workflows — tool calling, structured output, multi-step reasoning — exactly what a CI review gate needs.
## Step 2: Pick Your Quantization
Here is the gotcha that will save you hours: teams default to Q4_K_M without benchmarking whether the quality drop matters for their use case. Worse, they forget that VRAM consumption isn't just model weights. KV cache overhead adds 2–6 GB depending on context length, and that will push you over the edge on boundary hardware.
These estimates assume a 4K-token context window. If you plan to feed full PR diffs at 8K–16K tokens, add 3–6 GB to the VRAM figures.
| Quantization | Model Size | VRAM (weights + KV @ 4K ctx) | Quality Impact | Best For |
|---|---|---|---|---|
| Q5_K_S | ~24 GB | ~28–30 GB | Minimal degradation | Code review where precision matters |
| Q4_K_M | ~20 GB | ~24–26 GB | Slight degradation on nuanced reasoning | General refactoring suggestions, linting |
| Q3_K_M | ~16 GB | ~20–22 GB | Noticeable quality loss | Rough triage, classification only |
A 24 GB card (RTX 4090, A5000) is tight for Q5_K_S once KV cache is factored in. You will likely need to cap context length or drop to Q4_K_M. With 32 GB (A6000 Ada), Q5_K_S at 8K context is comfortable. On a 16 GB card, Q4_K_M only works at short context windows.
Practical note on context budget: truncate or chunk large diffs to stay within your VRAM budget. A 500-line diff runs roughly 4K–6K tokens. For larger PRs, split the diff by file and review in batches. The model handles focused, single-file context better anyway.
## Step 3: Choose Your Serving Engine
This decision comes down to concurrency.
| Factor | vLLM | llama.cpp (llama-server) |
|---|---|---|
| Throughput (concurrent) | High, continuous batching, PagedAttention | Lower, single-sequence optimized |
| Setup complexity | Requires Python env, CUDA toolkit | Single binary, minimal dependencies |
| Quantization support | GPTQ, AWQ, FP8 | GGUF (Q2–Q8, imatrix) |
| Structured output | Via outlines / guided decoding | Via GBNF grammars |
| Ideal for | Shared team server, multiple PRs queued | Single-runner, sequential review |
Here is the minimal setup to get this working: for a self-hosted GitHub Actions runner processing one PR at a time, llama.cpp's simplicity wins. If you are building a centralized review service behind an API that multiple repos hit, vLLM's batching justifies the extra setup.
## Step 4: Enforce Structured Output with Constrained Decoding
The docs do not mention this, but the piece that makes this actually work in CI is constrained decoding. You need JSON conforming to a schema so your CI script can programmatically extract verdicts, file paths, and suggested diffs.
With llama.cpp, you do this via GBNF grammars. Here is a minimal review verdict schema:
json
{
"verdict": "approve | request_changes | comment",
"findings": [
{
"file": "src/queue.js",
"line": 42,
"severity": "warning",
"message": "Unbounded queue growth — consider a max-size with backpressure."
}
]
}
Pass the corresponding GBNF grammar to the server's `--grammar` flag or per-request via the `grammar` field in the completions API. This guarantees every response is valid JSON matching your schema. No regex post-processing, no retry loops.
## Step 5: Wire It Into GitHub Actions
Pay close attention here. Shell-interpolating raw diff content into a JSON heredoc will break on quotes, backslashes, and newlines — and it is a command-injection vector. Use `jq` to safely encode the diff as a JSON string. Don't skip this.
yaml
.github/workflows/ai-review.yml
jobs:
code-review:
runs-on: self-hosted
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Generate diff
run: git diff origin/main...HEAD > /tmp/pr.diff
- name: Run AI review
run: |
jq -n \
--arg diff "$(cat /tmp/pr.diff)" \
--arg grammar "$(cat review-schema.gbnf)" \
'{
model: "qwen3.6-35b-a3b",
messages: [
{role: "system", content: "You are a code reviewer. Output JSON only."},
{role: "user", content: $diff}
],
grammar: $grammar
}' | \
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d @- | \
jq '.choices[0].message.content | fromjson' > review.json
- name: Gate on verdict
run: |
verdict=$(jq -r '.verdict' review.json)
if [ "$verdict" = "request_changes" ]; then exit 1; fi
By using `jq -n --arg`, the diff content is properly escaped into valid JSON regardless of what characters appear in the source code. This runs entirely on your hardware. Zero tokens billed. Full control over the model, the prompt, and the review criteria.
## Gotchas
- **VRAM math must include KV cache.** The model weights alone fit, but at 8K+ context your KV cache can add 3–6 GB. Benchmark with representative diffs before committing to a quantization level, because synthetic benchmarks won't tell you how it handles your codebase's idioms.
- **Freeform text output in CI is a reliability problem.** Enforce structured output from day one with GBNF grammars or guided decoding. One malformed response breaks your gate, and you will not notice until a PR is blocked at 2 AM.
- **Don't ship a blocking gate on day one.** Start with the reviewer as advisory, not authoritative. Wire it as a non-blocking check (`continue-on-error: true`), watch its findings for a few weeks, then tighten to a blocking gate once you have calibrated the prompt and thresholds against your actual code. I've seen teams skip this step and burn trust with developers by shipping a gate that flags nonsense on day one.
- **Q4_K_M on a 24 GB card is the practical sweet spot for most teams.** Only go Q5_K_S if you have 32+ GB or can keep context under 4K tokens.
## Wrapping Up
You now have a pattern I use in every project that needs AI-assisted review without cloud dependencies: a locally quantized MoE model, constrained to emit structured JSON, wired directly into your CI pipeline. The entire stack — model, serving, and integration — runs on a single workstation GPU. Start advisory, calibrate your prompts, then promote to a blocking gate when you trust the output. That is how you ship this responsibly.
Top comments (0)