David Evdoshchenko

Posted on May 12

How I Built an llms.txt Generator That Actually Works at Scale

#llms #typescript #ai #architecture

This is the technical companion to my I Built an llms.txt Generator, Showed It to the Creator of the Standard, and Had to Rewrite Everything — the human side is there, here's just the engineering.

The goal: automatically generate a proper llms.txt hierarchy for any website — not a flat index of summaries, but a structured set of MD files where semantically related pages are merged into coherent documents. Here's how each layer works and what broke along the way.

The Architecture

Sitemap → Crawler → Embedder → Clusterer → Summarizer → llms.txt + MD files

Five stages. Each runs at a different speed. Each has its own failure modes.

Stage 1: Crawling

Standard crawling with content extraction. The output per page: path, title, clean text.

Pages that fail to crawl are tracked but don't stop the pipeline — a missing page just doesn't contribute to its cluster.

Stage 2: Embeddings + Caching

Each page gets converted to a vector using Gemini's gemini-embedding-001 model.

Vectors are cached in Redis keyed by hostname + model + path. On reprocessing the same site, already-embedded pages are served from cache instantly — no API calls, no cost.

const allCached = await this.cacheService.hmget(
  hashKey,
  allPaths.map(p => `vectors:${p}`)
);
// Cache hits skip embedding entirely

This matters because embeddings are the most parallelizable step and you don't want to redo them on retries or restarts.

Stage 3: K-Means Clustering with Cosine Similarity

Pages cluster by semantic meaning, not URL structure. Two pages about the same concept with different URL paths end up in the same cluster. Three pages under /payments/checkout/ that cover different topics end up in different clusters.

K-means implemented directly in TypeScript — no external ML libraries. Cosine similarity as the distance metric, not Euclidean — cosine works much better for high-dimensional embedding vectors.

private kMeans(vectors: number[][], k: number, maxIterations = 100): number[] {
  const dim = vectors[0].length;
  let centroids = vectors.slice(0, k).map(v => [...v]);
  let assignments = new Array<number>(vectors.length).fill(0);

  for (let iter = 0; iter < maxIterations; iter++) {
    const newAssignments = vectors.map((v) => {
      let minDist = Infinity, nearest = 0;
      for (let c = 0; c < centroids.length; c++) {
        const dist = 1 - this.cosineSimilarity(v, centroids[c]);
        if (dist < minDist) { minDist = dist; nearest = c; }
      }
      return nearest;
    });

    const changed = newAssignments.some((a, i) => a !== assignments[i]);
    assignments = newAssignments;
    if (!changed) break;

    centroids = Array.from({ length: k }, (_, c) => {
      const members = vectors.filter((_, i) => assignments[i] === c);
      if (members.length === 0) return centroids[c];
      const sum = new Array<number>(dim).fill(0);
      for (const v of members) {
        for (let d = 0; d < dim; d++) sum[d] += v[d];
      }
      return sum.map(s => s / members.length);
    });
  }
  return assignments;
}

The number of clusters is dynamic — determined by the actual content distribution, not a fixed parameter.

Stage 4: Cluster Summarization with Context Caching

Each cluster goes through a two-phase generation process.

Phase 1 — Structure: One LLM call that returns the section name, description, and how many output MD pages to generate. The LLM decides whether to merge source pages, split them, or skip irrelevant ones — I don't impose that from outside.

Phase 2 — Content: One call per output page, generating filename, title, summary, and full markdown content.

The Context Caching Problem

Naive implementation: send all cluster page content as context on every call. For a cluster generating 5 MD files, you pay for those input tokens 5 times.

Fix: Gemini Context Caching. Upload the cluster content once, get a cache reference, use it for all subsequent calls within the cluster.

private async createCacheStrategy(
  model: string,
  systemInstruction: string,
  pagesText: string,
  baseConfig: Record<string, unknown>
) {
  try {
    const cached = await this.ai.caches.create({
      model,
      config: {
        ttl: '600s',
        systemInstruction,
        contents: `Pages:\n${pagesText}`
      }
    });

    return {
      config: { ...baseConfig, cachedContent: cached.name },
      getContents: (prompt: string) => prompt,
      dispose: async () => {
        await this.ai.caches.delete({ name: cached.name }).catch(() => {});
      },
      refreshIfNeeded: () => {
        void this.ai.caches.update({
          name: cached.name,
          config: { ttl: '600s' }
        }).catch(() => {});
      }
    };
  } catch (err) {
    // Gemini requires minimum token count for caching.
    // Small clusters fall back to inline context.
    if (err instanceof ApiError && err.status === 400
        && err.message.includes('min_total_token_count')) {
      return {
        config: { ...baseConfig, systemInstruction },
        getContents: prompt => `Pages:\n${pagesText}\n\n${prompt}`,
        dispose: () => Promise.resolve(),
        refreshIfNeeded: () => {}
      };
    }
    throw err;
  }
}

The cache TTL is 600 seconds. For large clusters that take longer, refreshIfNeeded() is called after each page generation to reset the TTL before it expires.

The cache is deleted in the finally block after the cluster finishes — no leaking paid cache slots.

Stage 5: Buffers Between Layers

Crawling, embedding, and summarizing run at completely different speeds. The crawler is fast. The embedder processes in batches. The summarizer is slow — LLM calls take seconds to minutes each.

Wire them together naively and they block each other. The crawler piles up thousands of pages while the embedder waits. The summarizer starves waiting for embeddings to flush.

Solution: in-memory buffers between each layer. Each stage writes to its output buffer and reads from the previous stage's buffer. Concurrency is controlled independently per stage via the AIMD queue.

The AIMD Queue

The core reliability mechanism. AIMD — Additive Increase Multiplicative Decrease — is the same algorithm TCP uses for congestion control, applied to LLM API calls.

private onSuccess(): void {
  this.successStreak++;
  if (this.successStreak >= this.concurrency) {
    const next = this.concurrency + 1;
    this.concurrency = this.maxConcurrency !== null
      ? Math.min(next, this.maxConcurrency)
      : next;
    this.successStreak = 0;
  }
}

private onRateLimit(kind: ErrorKind): void {
  const prev = this.concurrency;
  this.concurrency = Math.max(Math.floor(this.concurrency / 2), 1);
  this.successStreak = 0;
}

Rules:

Start at concurrency = 1
After a full successful round: concurrency += 1
On 429 or 503: concurrency = floor(concurrency / 2)
Failed tasks re-enqueue at the front of the queue, not the back — so they're not starved behind new work

For 429 responses specifically, the queue reads the actual delay from Google's google.rpc.RetryInfo response header instead of using a fixed backoff:

private static extractRetryDelayMs(err: unknown): number {
  const errObj = err as Record<string, unknown>;
  const details = (errObj?.error as Record<string, unknown>)
    ?.details as Record<string, unknown>[] | undefined;
  const retryInfo = details?.find(
    d => d['@type'] === 'type.googleapis.com/google.rpc.RetryInfo'
  );
  const retryDelayStr = retryInfo?.['retryDelay'] as string | undefined;
  if (retryDelayStr) {
    const parsed = parseFloat(retryDelayStr);
    if (!isNaN(parsed)) return parsed * 1000;
  }
  return 15000; // fallback
}

If the API says wait 15 seconds, wait 15 seconds. Not 1 second (too aggressive), not 60 seconds (too slow).

Max 16 attempts per task. 5 minute timeout per call. After that, permanent failure.

Typed LLM Exceptions

LLM failures aren't all the same. Treating them identically means either retrying things that can't succeed or giving up on things that would succeed on the next attempt.

// Response isn't valid JSON at all
class LlmJsonValidationException extends LlmBaseException {}

// Asked for N summaries, got M
class LlmResponseCountMismatchException extends LlmBaseException {
  constructor(
    message: string,
    public readonly expected: number,
    public readonly received: number,
    public readonly invalidResponse: unknown,
    attemptNumber: number
  ) { ... }
}

// JSON valid but missing required fields
class LlmInvalidSummaryFieldException extends LlmBaseException {}

Each type carries the context needed to decide the retry strategy. LlmResponseCountMismatchException includes what was expected vs received — useful for adjusting the prompt on retry.

Resumable Processing

Tasks have a max attempt count. After 16 failures, a task is dropped from the queue permanently.

But the order stays in the database. The user can restart it from the frontend.

And because all intermediate results — vectors, generated MD files — are cached in Redis, restarting picks up where it left off. Pages that already embedded don't re-embed. Clusters that already generated their MD files don't regenerate. Only the failed work gets retried.

This matters because a large order can fail partway through due to external issues (API downtime, budget limits, the German spaces problem described below) and you don't want to throw away hours of completed work.

The German Spaces Problem

For reasons I still don't fully understand, Gemini sometimes responds to German-language content with a valid response followed by approximately 2,000,000 spaces. This hits max token limits and the call crashes.

It's not consistent. It's not reproducible on demand. It just happens sometimes with German text.

The handling: 3 retries, then permanent failure for that task. The order remains restartable. The resumable processing above means retrying the order doesn't cost anything for already-completed work.

What the Output Looks Like

# docs.stripe.com
Comprehensive documentation for integrating Stripe payments...

## Account Setup And Management
- [Account Setup Overview](/account-setup-and-management/overview.md): ...

## Payments
...

ZIP archive: llms.txt at root, subfolders with the .md files.

Same site, both strategies side by side:

Flat strategy output — one summary per URL
Clustered strategy output — semantic clusters, MD files, ZIP

Stripe's ~4,000 pages: under an hour on a 4-core / 8GB server.

What's Not Solved Yet

Multilingual sites. When Stripe added German translations their URL count jumped from ~3,500 to ~4,300. The generator processed everything and produced German MD files for what should be an English documentation site. Planned fix: URL pattern filter as an order parameter — you specify which URL patterns to include, the crawler respects it.

Custom prompts. The summarization prompt is currently fixed. Making it a user parameter would allow different output styles — technical reference, narrative guides, API docs — without changing the pipeline.

An Observation About the Future

There are already tools that generate complete human-readable documentation sites from markdown source files.

This suggests an interesting workflow inversion:

Generate llms.txt + structured markdown (the AI-oriented layer)
Generate the human-facing website from those same markdown files

The markdown becomes the source of truth. Write once, publish twice — one version for humans, one for AI agents.

Try It

Generator: llmstxtgenerator.svcpool.com. Free tier up to 5,000 pages. API available.

Full spec: llmstxt.org.

DEV Community