The expensive part of selective doc fetching isn't the files — it's the commits

#ai #git #documentation #opensource

In my last post I argued for resolving before you fetch: use the doc manifest to figure out which 50 files matter before you pull 200,000. That solves the obvious waste. But once we put it into production across dozens of large repos — working closely with the Azure DevOps and content-version provider teams — we found the real cost was hiding somewhere I didn't expect.

It wasn't the file content. It was the commit data.

The surprise: walking commit history is the bottleneck

When you ask a git provider "what changed, and at what version?", you're not making one cheap call. Commit data spans levels — trees reference trees, commits reference parents, and a naive walk fans out into a deep recursive traversal. For a build that needs 50 markdown files, we were spending most of our time not downloading those files, but resolving the commit graph around them.

The content fetch was the easy part. The version resolution was the tax.

The fix: cache the metadata, not the content

The instinct is to cache file content. We did the opposite.

Commit data has one beautiful property: it's immutable. A commit SHA points to exactly one tree, forever. So instead of caching the bytes of intro.md (which change, and which we want fresh), we cache the commit metadata — the tree structure, the file-to-blob mappings, the version graph.

The pattern looks like this:

One full recursive fetch, once. Pay the deep-walk cost a single time per repo and store the resulting commit/tree metadata.
Build incrementally with diffs. After that, never walk the full history again. Ask only for the commit diff since the last known SHA, and patch your cached metadata forward.

An expensive recursive traversal becomes a cheap incremental update. And because the cached layer is immutable metadata, there's no invalidation headache — you're only ever appending knowledge of new commits, never invalidating old ones.

The unexpected bonus: honest freshness

This caching layer turned out to do more than save time. It made freshness honest.

If you're feeding docs into a RAG pipeline or an agent's knowledge base, the scariest failure mode is an answer sourced from a stale or irrelevant file. The commit diff is the cleanest possible signal for this: it tells you precisely which docs moved, were added, or were deleted — at the version level. You stop re-indexing on a blind schedule and start re-indexing on actual change. "Is this doc still current?" becomes a cheap lookup against {repo, commitSha, path} instead of a re-fetch.

Where this bites differently: monorepo vs. many small repos

The same problem shows up in opposite shapes depending on how your org structures things.

In a monorepo, the hard part is precision. One commit graph, but docs are buried among 200k files. The commit diff is your friend — one cached tree plus a diff tells you exactly which of those files changed, and the manifest's src scoping and exclude patterns keep a greedy **/*.md from sweeping in changelogs and archived folders. Freshness is comparatively easy because everything shares one version.

Across many repos under one org, it's a coordination and provenance problem. Fifty repos, fifty commit graphs, fifty manifests — some with no manifest at all. This is where the cached commit metadata earns its keep: you can answer "what changed across all of them?" by diffing each cheaply, instead of re-walking fifty histories on every build. A few things that helped us:

Trust the per-repo manifest as the definition of "what is a doc here." Don't impose a global glob — that's how irrelevant files leak across repo boundaries.
Pin everything to {repo, commitSha, path} and re-resolve on a commit diff, not a timer. Staleness is almost always a re-index-cadence problem, not a fetch problem.
Treat no-manifest repos as out-of-scope, explicitly. The "just index everything" fallback is the single biggest source of irrelevant answers downstream.

The takeaway

Selective fetching gets you from 200,000 files to 50. But the speed and the trustworthiness of the whole pipeline come from a layer underneath it: caching the immutable commit metadata once, then riding commit diffs forever. Monorepos push you toward better filtering; many-repos push you toward better provenance. The commit-diff layer is what makes both fast — and what lets an agent know its docs are actually current.

That same cached commit history powers something else, too: links to your docs that don't break when files get moved or renamed. That's the subject of the next post in this series.

This is part 2 of the **Docs-as-code at scale* series:*

Stop cloning entire repos for your doc builds
The expensive part of selective doc fetching isn't the files — it's the commits (you are here)
Links that don't break when your docs move

Sai Pramod Upadhyayula is a Senior Software Engineer at Microsoft working on AI-powered enterprise knowledge platforms, and a contributor to the DocFX open-source ecosystem.