i spent the last few days building an open-source AI code reviewer called Basira. one of the hardest design problems was figuring out how to feed entire github repos to an LLM without blowing past the context window or burning the budget. here's what i landed on.
The Problem
a medium repo is 50-200 files, 5-50k lines. claude sonnet has a 200k token context window, but stuffing the whole repo in is wasteful: most files don't need review at the same time, and the model loses focus on a wall of unrelated code.
Naive Approaches That Don't Work
One file per call: explodes API costs and loses cross-file context. an issue in
auth.pymight depend on a model defined inusers.py.Whole repo in one call: hits context limits on anything past a few thousand files, and quality drops as the model can't focus on what matters.
Random chunks: breaks logical units. you get half a class or half a function reviewed.
Three-Pass Chunking
Pass 1: Inventory
walk the repo, build a file tree with sizes and language. skip binaries, lockfiles, generated code, vendored deps. apply user-configured ignore patterns. no LLM calls in this pass, it's cheap.
def inventory_repo(repo_path: Path) -> list[FileEntry]:
entries = []
for path in repo_path.rglob("*"):
if should_skip(path):
continue
entries.append(FileEntry(
path=path,
size=path.stat().st_size,
language=detect_language(path),
tokens=estimate_tokens(path),
))
return entries
Pass 2: Grouping
bin files into chunks of ~8k tokens each, but keep related files together. files in the same directory tend to depend on each other, so they go in the same chunk. tests follow their source file when possible.
Pass 3: Review
send each chunk to claude with a structured prompt asking for findings in JSON, with severity, line numbers, and reasoning. parallelize chunks but rate-limit so we don't hit anthropic limits.
Tradeoffs
chunk boundary loss: if a function in chunk A is misused in chunk B, you won't catch it. mitigated partly by including a project summary in each chunk's prompt.
token budget per chunk: 8k is a sweet spot for sonnet. smaller = more API calls = more cost. bigger = quality drops.
ordering: putting more important files first means if budget runs out, you've reviewed the critical stuff. determining "important" is the hard part, currently using a heuristic (entry points + recently changed files).
Real Numbers
a scan of my own LogHunter repo (96 files, ~15k lines of python+go+react):
- 8 chunks
- 93k tokens in, 7k tokens out
- $0.39 total
- 3 min wall clock
- 65 findings (7 critical, 32 major, 26 minor)
What I Don't Know Yet
how this scales to monorepos (100k+ files). probably needs a different strategy entirely, maybe diff-based review.
whether semantic clustering (group files by what they do, not where they sit) beats directory-based grouping. would need embeddings.
if there's a way to get cross-chunk context without re-sending shared files.
Code
implementation is open source under MIT. chunking logic lives in backend/app/services/scan_engine.py. happy to discuss design decisions or take feedback.
repo: github.com/2lba/basira
if you've solved this differently i'd genuinely like to hear how.
Top comments (0)