sai pramod upadhyayula

Posted on May 27

Stop Cloning Entire Repos for Your Doc Builds

#documentation #opensource #typescript #devops

Your docs live next to your code. That's the docs-as-code promise — version control, pull request reviews, CI/CD pipelines. It works beautifully.

Until your repo hits 100,000 files.

The problem nobody talks about

Our team runs a documentation portal that pulls content from dozens of large repositories. Each doc build needs a handful of markdown files and images from repos containing hundreds of thousands of files. The naive approach — git clone — is painfully slow and wasteful.

We tried sparse checkout. We tried shallow clones. We tried the git provider APIs directly. Each came with its own problems:

Full clones: Minutes of download time for a build that needs 50 files
API file downloads: Hit rate limits after a few hundred files
Sparse checkout: Still requires git history negotiation and doesn't help with API-based pipelines

The irony? The manifest already declares exactly which files are needed. The docfx.json (or whatever config your static site generator uses) lists every content glob, every resource pattern. We just weren't using that information early enough.

Why this matters even more now: AI agents

This isn't just a build-speed problem anymore. If you're building AI agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. Not your code. Not your tests. The docs.

The challenge scales fast:

RAG pipelines need to ingest documentation from dozens of repos — cloning all of them is absurd
Incremental indexing requires knowing which files are documentation vs. code — the manifest already tells you
Multi-repo knowledge bases need a fast, selective way to pull only content files across many repos

The faster and more precisely you can extract documentation from your repositories, the fresher and more accurate your agents' knowledge becomes. Solving the selective fetch problem unlocks both faster builds and reliable AI-powered documentation experiences.

The idea: resolve before you fetch

What if we flipped the order?

Instead of: clone everything → build → throw away 99% of the files

We do: get the file listing → match against manifest → fetch only what matches

┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│  Git Provider    │     │ selective-repo-fetch  │     │  Doc Pipeline   │
│  (file listing)  │────▶│  (manifest matching   │────▶│  (build only    │
│                  │     │   + reference filter) │     │   matched files)│
└─────────────────┘     └──────────────────────┘     └─────────────────┘

A file tree listing from GitHub/Azure DevOps/GitLab is a single, cheap API call — it returns metadata, not file contents. Match that listing against your manifest patterns, and you know exactly what to fetch.

Introducing selective-repo-fetch

We open-sourced this logic as a TypeScript library: selective-repo-fetch. It's MIT-licensed and provider-agnostic.

npm install github:microsoft/selective-repo-fetch

Here's the core workflow:

import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

// Your manifest declares what your doc site needs
const manifest = {
  build: {
    content: [{ files: ['**/*.md'], src: 'docs' }],
    resource: [{ files: ['**/*.{png,jpg,svg}'], src: 'docs/images' }],
  },
};

// Step 1: Get file listing from any git API (one cheap metadata call)
const repoFiles = await getTreeListing(); // returns [{ path: '/docs/intro.md' }, ...]

// Step 2: Resolve manifest → content + resource matches
const matched = resolveFileMatches(repoFiles, manifest, '/', '/docfx.json');
// matched.contentMatches → only the markdown files your build needs
// matched.resourceMatches → only images/videos matching resource globs

From 200,000 files down to the 50 that matter. One function call.

Going further: filtering unreferenced resources

Glob matching is great, but it can be too generous. A **/*.png pattern in your resource section will match every image under that folder — even the ones no markdown file actually references.

For large repos, this matters. Unreferenced images can be megabytes of wasted downloads.

So we added a second pass:

// Step 3: Fetch the content files (small text — fast and cheap)
const contentFileTexts = {};
for (const filePath of matched.contentMatches) {
  contentFileTexts[filePath] = await fetchFileContent(filePath);
}

// Step 4: Filter resources to only those actually referenced
const referencedResources = filterReferencedResources(
  matched.resourceMatches,
  contentFileTexts
);
// Scans markdown/HTML for ![](path), <img src="path">, [text](path), etc.
// Drops any resource not referenced by any content file

This scans your content files for markdown image references (![](path)), links ([text](path)), and HTML attributes (src="path", href="path"). If a resource file isn't referenced anywhere in your content, it gets dropped.

The full pipeline

Here's what it looks like end-to-end with the GitHub API:

import { Octokit } from '@octokit/rest';
import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

const octokit = new Octokit({ auth: token });

// 1. One API call to get the full file tree (metadata only, no content)
const { data } = await octokit.git.getTree({
  owner, repo, tree_sha: 'HEAD', recursive: 'true'
});

const files = data.tree
  .filter(item => item.type === 'blob')
  .map(item => ({ path: '/' + item.path }));

// 2. Resolve manifest patterns
const manifest = JSON.parse(/* your docfx.json */);
const matched = resolveFileMatches(files, manifest, '/', '/docfx.json');

// 3. Fetch content files (small text)
const contentTexts: Record<string, string> = {};
for (const path of matched.contentMatches) {
  const { data } = await octokit.repos.getContent({ owner, repo, path: path.slice(1) });
  contentTexts[path] = Buffer.from(data.content, 'base64').toString();
}

// 4. Filter resources to only referenced ones
const resources = filterReferencedResources(matched.resourceMatches, contentTexts);

// 5. Fetch only referenced resources
// You now have the exact list — nothing wasted

What it handles

The manifest matching is thorough:

Glob patterns with brace expansion (*.{md,yml})
src path resolution relative to manifest location
Per-section excludes (exclude: ["**/draft/**"])
Templates, metadata files, .order files — auto-included
External references via src: "../other-folder" — discovered before you fetch

The reference filter handles:

Markdown images and links: ![alt](path), [text](path)
HTML attributes: <img src="path">, <video src="path">, <a href="path">
Path normalization: strips ~/, leading /, query strings, anchors
Skips external URLs, data URIs, mailto:, javascript:
Case-insensitive filename matching

When to use this

Documentation portals pulling from multiple repos — resolve before you clone
Monorepo doc builds — your manifest knows what matters, use it
CI/CD pipelines — cut build times by fetching only what changed
Any static site generator (DocFX, MkDocs, Sphinx, Docusaurus) that uses a manifest

Why this matters for AI agents

There's a downstream benefit we didn't anticipate when we first built this: making documentation efficiently available to AI agents.

If you're building agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. But they don't need your entire codebase. They need the docs.

The manifest-driven approach gives you exactly that separation:

Selective ingestion — only pull documentation into your RAG pipeline, not code, tests, or CI configs
Incremental updates — when a doc changes, you know it's a doc (not a code file) because the manifest says so
Multi-repo knowledge bases — pull docs from 50+ repos without cloning any of them

// Feed docs from multiple repos into your agent's knowledge base
for (const repo of repositories) {
  const files = await getTreeListing(repo);
  const matched = resolveFileMatches(files, repo.manifest, '/', '/docfx.json');

  // Only index documentation — not code, not tests, not configs
  for (const docPath of matched.contentMatches) {
    const content = await fetchFile(repo, docPath);
    await knowledgeBase.ingest({ path: docPath, repo: repo.name, content });
  }
}

The faster and more precisely you can extract documentation from your repos, the fresher and more accurate your agents' knowledge becomes. Efficient content fetching is the foundation of a reliable AI-powered docs experience.

Try it out

The library is MIT-licensed and has zero opinions about your git provider — it works with any API that can give you a file listing.

npm install github:microsoft/selective-repo-fetch

GitHub: microsoft/selective-repo-fetch

If your doc builds are slow because of large repos, give it a try. And if you have ideas for improvements, PRs are welcome.

What's the worst monorepo doc build experience you've had? I'd love to hear about it in the comments.

Top comments (2)

Taylor Dolezal • May 28

Hello, Sai! What I like here is that the manifest already knows which files are docs, and that you're reading it BEFORE the fetch rather than afterward.

We have run into this too at I work (Dosu) when working with docs across many OSS repos. What we've seen is a big problem downstream, when someone trusts an answer their agent sourced from out-of-date files or docs that were not relevant.

I'm curious how you see this playing out in a monorepo versus many small repos under one org. The many-repos case feels like where teams struggle most.

Great write-up, looking forward to the next one!

sai pramod upadhyayula • Jun 10

Hello Taylor, this comment is exactly what nudged me to write two follow-ups, so thank you for the prompt.

On monorepo vs. many small repos, I think they fail in opposite ways. Monorepos are a precision problem: one commit graph, but docs buried among hundreds of thousands of files, so the risk is over-matching — a greedy */.md pulling in changelogs and archived folders. The manifest's src scoping and exclude patterns do most of the work there. Many repos under one org are a coordination and provenance problem, and I agree it's where teams struggle most: every repo has its own manifest (or none), so you can't assume "docs" means the same thing across 50 of them. What's worked for us is trusting each repo's manifest as the source of truth, pinning every ingested file to {repo, commitSha, path}, re-resolving on a commit diff rather than a timer, and treating no-manifest repos as explicitly out-of-scope instead of falling back to "index everything" - that fallback is the single biggest source of irrelevant answers.

The staleness point you raised turned out to deserve its own writeup. The surprise for us was that the dominant cost wasn't fetching content — it was walking commit history. Caching that immutable commit metadata once and riding diffs is what makes freshness honest. I wrote it up here: The expensive part of selective doc fetching isn't the files — it's the commits (dev.to/saipramod/the-expensive-par...).

And the downstream half — making sure a cited link still resolves even after a doc moves in source control — became this one: Links that don't break when your docs move (dev.to/saipramod/links-that-dont-b...).

Curious how you're handling the no-manifest repos at Dosu — that's the gap I still find messiest.