DEV Community

sai pramod upadhyayula
sai pramod upadhyayula

Posted on

Stop Cloning Entire Repos for Your Doc Builds

Your docs live next to your code. That's the docs-as-code promise — version control, pull request reviews, CI/CD pipelines. It works beautifully.

Until your repo hits 100,000 files.

The problem nobody talks about

Our team runs a documentation portal that pulls content from dozens of large repositories. Each doc build needs a handful of markdown files and images from repos containing hundreds of thousands of files. The naive approach — git clone — is painfully slow and wasteful.

We tried sparse checkout. We tried shallow clones. We tried the git provider APIs directly. Each came with its own problems:

  • Full clones: Minutes of download time for a build that needs 50 files
  • API file downloads: Hit rate limits after a few hundred files
  • Sparse checkout: Still requires git history negotiation and doesn't help with API-based pipelines

The irony? The manifest already declares exactly which files are needed. The docfx.json (or whatever config your static site generator uses) lists every content glob, every resource pattern. We just weren't using that information early enough.

Why this matters even more now: AI agents

This isn't just a build-speed problem anymore. If you're building AI agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. Not your code. Not your tests. The docs.

The challenge scales fast:

  • RAG pipelines need to ingest documentation from dozens of repos — cloning all of them is absurd
  • Incremental indexing requires knowing which files are documentation vs. code — the manifest already tells you
  • Multi-repo knowledge bases need a fast, selective way to pull only content files across many repos

The faster and more precisely you can extract documentation from your repositories, the fresher and more accurate your agents' knowledge becomes. Solving the selective fetch problem unlocks both faster builds and reliable AI-powered documentation experiences.

The idea: resolve before you fetch

What if we flipped the order?

Instead of: clone everything → build → throw away 99% of the files

We do: get the file listing → match against manifest → fetch only what matches

┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│  Git Provider    │     │ selective-repo-fetch  │     │  Doc Pipeline   │
│  (file listing)  │────▶│  (manifest matching   │────▶│  (build only    │
│                  │     │   + reference filter) │     │   matched files)│
└─────────────────┘     └──────────────────────┘     └─────────────────┘
Enter fullscreen mode Exit fullscreen mode

A file tree listing from GitHub/Azure DevOps/GitLab is a single, cheap API call — it returns metadata, not file contents. Match that listing against your manifest patterns, and you know exactly what to fetch.

Introducing selective-repo-fetch

We open-sourced this logic as a TypeScript library: selective-repo-fetch. It's MIT-licensed and provider-agnostic.

npm install github:microsoft/selective-repo-fetch
Enter fullscreen mode Exit fullscreen mode

Here's the core workflow:

import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

// Your manifest declares what your doc site needs
const manifest = {
  build: {
    content: [{ files: ['**/*.md'], src: 'docs' }],
    resource: [{ files: ['**/*.{png,jpg,svg}'], src: 'docs/images' }],
  },
};

// Step 1: Get file listing from any git API (one cheap metadata call)
const repoFiles = await getTreeListing(); // returns [{ path: '/docs/intro.md' }, ...]

// Step 2: Resolve manifest → content + resource matches
const matched = resolveFileMatches(repoFiles, manifest, '/', '/docfx.json');
// matched.contentMatches → only the markdown files your build needs
// matched.resourceMatches → only images/videos matching resource globs
Enter fullscreen mode Exit fullscreen mode

From 200,000 files down to the 50 that matter. One function call.

Going further: filtering unreferenced resources

Glob matching is great, but it can be too generous. A **/*.png pattern in your resource section will match every image under that folder — even the ones no markdown file actually references.

For large repos, this matters. Unreferenced images can be megabytes of wasted downloads.

So we added a second pass:

// Step 3: Fetch the content files (small text — fast and cheap)
const contentFileTexts = {};
for (const filePath of matched.contentMatches) {
  contentFileTexts[filePath] = await fetchFileContent(filePath);
}

// Step 4: Filter resources to only those actually referenced
const referencedResources = filterReferencedResources(
  matched.resourceMatches,
  contentFileTexts
);
// Scans markdown/HTML for ![](path), <img src="path">, [text](path), etc.
// Drops any resource not referenced by any content file
Enter fullscreen mode Exit fullscreen mode

This scans your content files for markdown image references (![](path)), links ([text](path)), and HTML attributes (src="path", href="path"). If a resource file isn't referenced anywhere in your content, it gets dropped.

The full pipeline

Here's what it looks like end-to-end with the GitHub API:

import { Octokit } from '@octokit/rest';
import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

const octokit = new Octokit({ auth: token });

// 1. One API call to get the full file tree (metadata only, no content)
const { data } = await octokit.git.getTree({
  owner, repo, tree_sha: 'HEAD', recursive: 'true'
});

const files = data.tree
  .filter(item => item.type === 'blob')
  .map(item => ({ path: '/' + item.path }));

// 2. Resolve manifest patterns
const manifest = JSON.parse(/* your docfx.json */);
const matched = resolveFileMatches(files, manifest, '/', '/docfx.json');

// 3. Fetch content files (small text)
const contentTexts: Record<string, string> = {};
for (const path of matched.contentMatches) {
  const { data } = await octokit.repos.getContent({ owner, repo, path: path.slice(1) });
  contentTexts[path] = Buffer.from(data.content, 'base64').toString();
}

// 4. Filter resources to only referenced ones
const resources = filterReferencedResources(matched.resourceMatches, contentTexts);

// 5. Fetch only referenced resources
// You now have the exact list — nothing wasted
Enter fullscreen mode Exit fullscreen mode

What it handles

The manifest matching is thorough:

  • Glob patterns with brace expansion (*.{md,yml})
  • src path resolution relative to manifest location
  • Per-section excludes (exclude: ["**/draft/**"])
  • Templates, metadata files, .order files — auto-included
  • External references via src: "../other-folder" — discovered before you fetch

The reference filter handles:

  • Markdown images and links: ![alt](path), [text](path)
  • HTML attributes: <img src="path">, <video src="path">, <a href="path">
  • Path normalization: strips ~/, leading /, query strings, anchors
  • Skips external URLs, data URIs, mailto:, javascript:
  • Case-insensitive filename matching

When to use this

  • Documentation portals pulling from multiple repos — resolve before you clone
  • Monorepo doc builds — your manifest knows what matters, use it
  • CI/CD pipelines — cut build times by fetching only what changed
  • Any static site generator (DocFX, MkDocs, Sphinx, Docusaurus) that uses a manifest

Why this matters for AI agents

There's a downstream benefit we didn't anticipate when we first built this: making documentation efficiently available to AI agents.

If you're building agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. But they don't need your entire codebase. They need the docs.

The manifest-driven approach gives you exactly that separation:

  1. Selective ingestion — only pull documentation into your RAG pipeline, not code, tests, or CI configs
  2. Incremental updates — when a doc changes, you know it's a doc (not a code file) because the manifest says so
  3. Multi-repo knowledge bases — pull docs from 50+ repos without cloning any of them
// Feed docs from multiple repos into your agent's knowledge base
for (const repo of repositories) {
  const files = await getTreeListing(repo);
  const matched = resolveFileMatches(files, repo.manifest, '/', '/docfx.json');

  // Only index documentation — not code, not tests, not configs
  for (const docPath of matched.contentMatches) {
    const content = await fetchFile(repo, docPath);
    await knowledgeBase.ingest({ path: docPath, repo: repo.name, content });
  }
}
Enter fullscreen mode Exit fullscreen mode

The faster and more precisely you can extract documentation from your repos, the fresher and more accurate your agents' knowledge becomes. Efficient content fetching is the foundation of a reliable AI-powered docs experience.

Try it out

The library is MIT-licensed and has zero opinions about your git provider — it works with any API that can give you a file listing.

npm install github:microsoft/selective-repo-fetch
Enter fullscreen mode Exit fullscreen mode

GitHub: microsoft/selective-repo-fetch

If your doc builds are slow because of large repos, give it a try. And if you have ideas for improvements, PRs are welcome.


What's the worst monorepo doc build experience you've had? I'd love to hear about it in the comments.

Top comments (0)