Chunking Strategies for AI Code Review on Large Repos

#ai #programming #python #opensource

i spent the last few days building an open-source AI code reviewer called Basira. one of the hardest design problems was figuring out how to feed entire github repos to an LLM without blowing past the context window or burning the budget. here's what i landed on.

The Problem

a medium repo is 50-200 files, 5-50k lines. claude sonnet has a 200k token context window, but stuffing the whole repo in is wasteful: most files don't need review at the same time, and the model loses focus on a wall of unrelated code.

Naive Approaches That Don't Work

One file per call: explodes API costs and loses cross-file context. an issue in auth.py might depend on a model defined in users.py.
Whole repo in one call: hits context limits on anything past a few thousand files, and quality drops as the model can't focus on what matters.
Random chunks: breaks logical units. you get half a class or half a function reviewed.

Three-Pass Chunking

Pass 1: Inventory

walk the repo, build a file tree with sizes and language. skip binaries, lockfiles, generated code, vendored deps. apply user-configured ignore patterns. no LLM calls in this pass, it's cheap.

def inventory_repo(repo_path: Path) -> list[FileEntry]:
    entries = []
    for path in repo_path.rglob("*"):
        if should_skip(path):
            continue
        entries.append(FileEntry(
            path=path,
            size=path.stat().st_size,
            language=detect_language(path),
            tokens=estimate_tokens(path),
        ))
    return entries

Pass 2: Grouping

bin files into chunks of ~8k tokens each, but keep related files together. files in the same directory tend to depend on each other, so they go in the same chunk. tests follow their source file when possible.

Pass 3: Review

send each chunk to claude with a structured prompt asking for findings in JSON, with severity, line numbers, and reasoning. parallelize chunks but rate-limit so we don't hit anthropic limits.

Tradeoffs

chunk boundary loss: if a function in chunk A is misused in chunk B, you won't catch it. mitigated partly by including a project summary in each chunk's prompt.
token budget per chunk: 8k is a sweet spot for sonnet. smaller = more API calls = more cost. bigger = quality drops.
ordering: putting more important files first means if budget runs out, you've reviewed the critical stuff. determining "important" is the hard part, currently using a heuristic (entry points + recently changed files).

Real Numbers

a scan of my own LogHunter repo (96 files, ~15k lines of python+go+react):

8 chunks
93k tokens in, 7k tokens out
$0.39 total
3 min wall clock
65 findings (7 critical, 32 major, 26 minor)

What I Don't Know Yet

how this scales to monorepos (100k+ files). probably needs a different strategy entirely, maybe diff-based review.
whether semantic clustering (group files by what they do, not where they sit) beats directory-based grouping. would need embeddings.
if there's a way to get cross-chunk context without re-sending shared files.

Code

implementation is open source under MIT. chunking logic lives in backend/app/services/scan_engine.py. happy to discuss design decisions or take feedback.

repo: github.com/2lba/basira

if you've solved this differently i'd genuinely like to hear how.