RAG from Public Documentation Websites: Robots.txt, Terms, Retention, and Attribution

#api #programming

Public Docs Are the Easiest RAG Source to Get Wrong

Every AI support project eventually reaches for public documentation. The pages are already written. They are structured. They explain the product better than any internal wiki. A crawler can fetch them in minutes.

Then the problems start.

The docs site changes its navigation and half your URLs 404. Versioned pages duplicate the same paragraph across /v2/, /v3/, and /latest/. Code blocks lose their indentation and language annotations when your HTML parser flattens everything to text. Tables — the part developers actually reference — turn into a wall of unseparated words. The retrieval system answers confidently from a page that was deleted three months ago. And because there are no citations, the user cannot tell whether the model is quoting official docs or hallucinating something plausible.

Public documentation looks like low-friction input. In practice, it is harder to get right than your own internal content — because you do not control the source, the structure, or the lifecycle.

Three categories of problems make public docs uniquely tricky for RAG: legal constraints that do not apply to your own content, structural fidelity that most ingestion pipelines destroy, and lifecycle management for content you did not author and cannot predict. Each one can quietly degrade your system. Together, they are why the simplest-looking RAG source often produces the least reliable answers.

You Do Not Control This Content

When you build RAG over your own knowledge base, you control the content. You know when it changes, why it changes, and whether it should be in the index. Public docs are someone else's content, served on someone else's infrastructure, under someone else's rules.

That changes the ingestion calculus in ways developers tend to skip over.

Robots.txt Is a Signal, Not a Contract

Start with robots.txt, but do not stop there.

robots.txt tells automated agents which paths the site owner asks crawlers not to access. It is not a complete legal framework, and it is not a universal answer. But ignoring it is a bad default. If the docs site disallows a path for automated access, your ingestion job should treat that as a stop sign unless you have a separate permission basis.

The nuances matter for documentation sites specifically:

Versioned paths may have different rules. A site might allow /docs/latest/ but disallow /docs/v1/ through /docs/v5/. If your RAG system needs historical versions, that distinction is the difference between allowed and disallowed ingestion.
API reference pages are often disallowed separately from guides. Auto-generated reference pages have different crawl rules than hand-written tutorials. Check both.
Crawl-delay guidance exists. Some robots.txt files specify a Crawl-delay directive. A docs site asking for 10 seconds between requests is telling you something about their infrastructure. A full re-index that ignores that directive can get your IP blocked.
Sitemaps are listed in robots.txt. A sitemap-driven ingestion job is easier to control than a recursive crawler. Sitemaps give you canonical URLs, <lastmod> timestamps, and the complete page inventory. Use them.

A robots.txt check is five minutes of work and saves you from the most obvious compliance failures. But it is only the first check, not the last.

Terms Restrict More Than You Expect

Documentation sites often have terms of service that restrict automated access, redistribution, commercial use, or derivative datasets. Developers tend to skip these because the pages are public and technical. That shortcut is risky when the RAG system is part of a commercial product.

What to look for:

Automated access restrictions. Some terms explicitly prohibit scraping, crawling, or automated retrieval — even if robots.txt allows the path. The terms are the legal layer; robots.txt is the technical layer. They do not always agree.
Caching and copying restrictions. Terms may allow you to read the page but not store a copy. That matters because RAG systems store chunks indefinitely by default.
Competitive use clauses. If you are building a product that competes with the docs owner, some terms explicitly prohibit using their documentation for that purpose.
Attribution requirements. Many open-source project docs are licensed under Creative Commons or similar licenses that require attribution. Your RAG system's citations need to comply with those requirements — which means attribution is not optional.

If the documentation belongs to your own product, these concerns are simple. If it belongs to a vendor, an open-source project, or a competitor, the review matters more.

The safest ingestion pattern: have explicit permission, use the docs for a legitimate purpose, retain only what you need, and link users back to the source.

Your Own Docs Are the Easy Case

For most teams building RAG, the first and best target is your own documentation. You control the content, you control the terms, you know when pages change, and you have every right to process them however you want.

The irony is that many teams skip their own docs ("we already have them in the CMS") and jump straight to ingesting third-party documentation where every legal and lifecycle question is harder. Start with your own content. Build the pipeline right. Then extend it to external sources where the permission basis is clear.

HTML to Text Is Where Quality Dies

The technical problem with public docs is not fetching them. Any HTTP client can fetch an HTML page. The problem is what happens next.

Most ingestion pipelines strip HTML tags and dump the text content. That destroys the structure that makes documentation useful for retrieval.

Consider a typical API reference page. In the browser, it looks like this:

## Create an API Key

Send a POST request to create a new API key for your project.

```

bash
curl -X POST https://api.example.com/keys \
  -H "Authorization: Bearer TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "production", "scope": "project-123"}'


```

### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| name  | string | yes    | Human-readable key name |
| scope | string | no     | Optional project scope |

> Warning: API keys are shown only once. Store them securely.
```

`

After a typical `innerText` extraction, the same content becomes:

```text
Create an API Key Send a POST request to create a new API key for your project. curl -X POST https://api.example.com/keys -H Authorization: Bearer TOKEN -H Content-Type: application/json -d {"name": "production", "scope": "project-123"} Parameters Field Type Required Description name string yes Human-readable key name scope string no Optional project scope Warning: API keys are shown only once. Store them securely.
```

The second version still contains the words. It no longer contains the documentation.

The heading that marks a section boundary is gone — there is no way to chunk by topic. The code block lost its formatting, its language annotation, and the line continuations that make the `curl` command readable. The parameter table is now a run-on sentence where "name string yes" could mean anything. The warning callout, which is visually distinct in HTML, is indistinguishable from body text.

This matters for RAG in two concrete ways. First, embeddings trained on well-structured text perform worse on flattened strings where code, prose, and table data are concatenated without boundaries. Second, when the LLM generates an answer from a retrieved chunk, it cannot reproduce a table it never saw as a table, or format a code example it received as a single line.

### Markdown Preserves What Matters

The fix is to normalize HTML into markdown before chunking. Markdown keeps:

- **Headings as semantic boundaries.** `## Create an API Key` is a chunk boundary. A topic marker. A citation anchor.
- **Code blocks as code.** Indentation, language tags, and multi-line structure survive.
- **Tables as tables.** Rows, columns, and headers remain queryable and reproducible.
- **Links as references.** Internal links become navigation signals for the retrieval system.
- **Callouts as distinct elements.** Warnings, notes, and tips remain identifiable.

The conversion is not trivial. Real documentation HTML includes navigation sidebars, footer links, cookie banners, breadcrumb trails, search widgets, and JavaScript-rendered content that is invisible in the raw HTML. A good HTML-to-markdown pipeline strips all of that and keeps only the content.

Here is what the conversion looks like using the **[Document to Markdown](https://iterationlayer.com/blog/document-to-markdown-mcp-claude-cursor)** API, which handles HTML cleanup, table preservation, and code block detection:

```typescript
import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const result = await client.convertDocumentToMarkdown({
  file: {
    type: "url",
    url: "https://docs.example.com/api/authentication",
  },
});

// result.markdown contains clean markdown with headings, code blocks, and tables preserved
```


For JavaScript-rendered documentation sites (React, Next.js, single-page apps), add `fetch_options` to request browser-based retrieval:

```json
{
  "file": {
    "type": "url",
    "url": "https://docs.example.com/api/authentication",
    "fetch_options": {
      "should_render_javascript": true
    }
  }
}
```

The output is clean markdown — ready for chunking, embedding, and retrieval.

## Chunk by Documentation Structure, Not Token Windows

Once you have clean markdown, the next question is how to split it into chunks for embedding. The generic approach — fixed-size token windows with overlap — works for undifferentiated text. Documentation is not undifferentiated text.

Documentation has natural boundaries that token-window chunking ignores:

- A section under a `##` heading is a self-contained topic
- A parameter table belongs with its endpoint description
- A code example belongs with the paragraph that explains it
- A warning or note is a unit of meaning

Split a table from its header, and the chunk is ambiguous. Split a code example from the paragraph that sets it up, and the chunk loses intent. Merge two unrelated sections into one chunk because they happen to fit in the token window, and the embedding represents neither topic well.

### Use Heading Hierarchy as Your Chunk Map

Markdown headings give you a free topic tree. Use it.

For API reference pages, keep endpoint descriptions, parameter tables, and code examples together as one chunk. A chunk that contains only a parameter table without the endpoint path is not useful — the embedding model does not know what API those parameters belong to.

For guide pages, split at `##` and `###` headings, but carry the parent heading chain into chunk metadata. A chunk titled "Authentication" is ambiguous across products. A chunk titled "Payments API > Authentication > OAuth2 Flow" is specific enough that retrieval can rank it correctly.

For long sections, treat code blocks and tables as atomic units. If a section is too long for your embedding model's context window, split at the nearest heading or paragraph boundary — never in the middle of a table or code block.

### Carry Context Into Every Chunk

A chunk without context is a chunk that retrieves for the wrong queries. Every chunk in your index should carry:

- **Source URL** — the canonical page URL, not a crawler-generated path
- **Heading path** — the full heading chain (e.g., `["API Reference", "Authentication", "OAuth2"]`)
- **Page title** — for display in citations and for retrieval filtering
- **Product or project name** — especially important when ingesting docs from multiple sources
- **Docs version** — if the site has versioned docs, this prevents retrieving from the wrong version
- **Content hash** — for efficient update detection (see lifecycle section)
- **Retrieved-at timestamp** — so users and operators know how fresh the index is

This metadata is not optional overhead. It is what makes retrieval reliable and citations trustworthy.

## Extract Page Metadata as Typed Fields

Some of that metadata — page title, product area, last-updated date — is visible on the page but not in the markdown body. You could try to infer it from the heading or parse it from HTML meta tags. Or you can extract it as typed, validated fields.

This is where **Website Extraction** fits into the pipeline. While Document to Markdown converts the page body for embedding, Website Extraction pulls structured metadata from the same page:

```typescript
import { IterationLayer } from "iterationlayer";

const client = new IterationLayer({ apiKey: "YOUR_API_KEY" });

const metadata = await client.extractWebsite({
  file: {
    type: "url",
    url: "https://docs.example.com/api/authentication",
  },
  schema: {
    fields: [
      { name: "page_title", type: "TEXT", description: "The title of the documentation page" },
      { name: "product_area", type: "TEXT", description: "The product area or API category this page covers" },
      { name: "last_updated", type: "DATE", description: "The visible last-updated date, if shown on the page" },
      { name: "api_version", type: "TEXT", description: "The API version this page documents, if specified" },
      {
        name: "prerequisites",
        type: "ARRAY",
        description: "Prerequisites or requirements listed on the page",
        fields: [
          { name: "prerequisite", type: "TEXT", description: "A single prerequisite or requirement" },
        ],
      },
    ],
  },
});
```

```json
{
  "success": true,
  "data": {
    "page_title": {
      "type": "TEXT",
      "value": "Authentication",
      "confidence": 0.98,
      "citations": ["Authentication"],
      "source": "authentication.html"
    },
    "product_area": {
      "type": "TEXT",
      "value": "API Reference",
      "confidence": 0.94,
      "citations": ["API Reference > Authentication"],
      "source": "authentication.html"
    },
    "last_updated": {
      "type": "DATE",
      "value": "2026-03-15",
      "confidence": 0.91,
      "citations": ["Last updated: March 15, 2026"],
      "source": "authentication.html"
    },
    "api_version": {
      "type": "TEXT",
      "value": "v2",
      "confidence": 0.89,
      "citations": ["API v2"],
      "source": "authentication.html"
    },
    "prerequisites": {
      "type": "ARRAY",
      "value": [
        {
          "prerequisite": {
            "type": "TEXT",
            "value": "An active API key",
            "confidence": 0.95,
            "citations": ["You need an active API key"],
            "source": "authentication.html"
          }
        }
      ],
      "confidence": 0.95,
      "citations": ["Prerequisites: An active API key"],
      "source": "authentication.html"
    }
  },
  "metadata": {
    "url": "https://docs.example.com/api/authentication"
  }
}
```


This extracted metadata flows directly into your chunk records. Instead of guessing the product area from the URL path or using [document parsing](https://iterationlayer.com/blog/complete-guide-document-parsing) heuristics on a "Last updated" string from the HTML, you get typed, validated fields with confidence scores. The `product_area` field becomes a retrieval filter. The `last_updated` date becomes a freshness signal. The `prerequisites` array becomes context the LLM can use when generating answers.

### Two APIs, One Pipeline

The ingestion workflow for a single documentation page becomes:

1. **Convert** the page to markdown with Document to Markdown — clean body text for chunking and embedding
2. **Extract** metadata with Website Extraction — typed fields for indexing, filtering, and citations
3. **Chunk** the markdown by heading hierarchy, attaching extracted metadata to every chunk
4. **Embed** and store in your vector database

Both API calls use the same auth token and the same credit pool. The first gives you content for retrieval. The second gives you structure for filtering and display. Together, they turn an HTML page into index-ready chunks with reliable metadata — without writing a custom parser for every documentation site's layout.

## Stale Chunks Are Worse Than Missing Chunks

The lifecycle problem with public docs is unique: you do not know when the source changes, and you have no webhook to tell you.

When a page is updated, your index still contains the old version. When a page is deleted, your index still serves answers from it. When a product deprecates a feature, your RAG system confidently recommends the deprecated approach — because the old chunk is still in the vector database with a high similarity score.

This is worse than having no answer at all. A missing answer prompts the user to search elsewhere. A confident wrong answer from stale docs erodes trust in the entire system.

### Build a Refresh Pipeline, Not a One-Time Crawl

The initial crawl is the easy part. The hard part is keeping the index aligned with the source over weeks and months.

A reliable refresh pipeline needs:

- **Sitemap-driven page inventory.** Use the sitemap as the source of truth for which pages exist. When a URL disappears from the sitemap, remove its chunks from the index. Do not rely on crawling navigation links — navigation changes break crawlers.
- **Content hashing for efficient updates.** Hash the markdown content of each page. On the next refresh cycle, re-convert the page and compare hashes. If the hash matches, skip re-embedding — the content has not changed. If the hash differs, re-chunk, re-embed, and replace the old chunks. This keeps refresh cycles fast and embedding costs proportional to actual changes.
- **`<lastmod>` as a hint, not a guarantee.** Sitemaps include `<lastmod>` timestamps, but many sites do not update them reliably. Use them to prioritize which pages to re-fetch first, but do not skip pages just because `<lastmod>` has not changed.
- **404 handling.** When a page returns 404, remove all chunks for that URL from the index. Do not retry indefinitely. A page that 404s for two consecutive refresh cycles is gone.
- **Version pruning.** If you ingest versioned docs (`/v1/`, `/v2/`, `/v3/`), decide which versions to keep. Most RAG systems should index only the latest version unless users explicitly need historical answers. Old versions that stay in the index dilute retrieval quality — the same concept explained three slightly different ways across three versions produces confusing results.

### Retention Is Not Just a Compliance Question

Beyond freshness, think about what you keep and for how long.

RAG systems become accidental archives. A page is fetched once, embedded, and then lives in a vector database indefinitely — long after the source page changed, moved, or was removed on purpose. The source owner may have removed content for a reason: a security issue, a legal retraction, a product pivot. Your system should not keep serving that content.

Public docs can also include personal data that triggers retention obligations: author names, contributor emails, support contact paths, example data with real-looking identifiers. Even though the pages are public, storing that data indefinitely in your own systems has compliance implications — especially under GDPR, where the original publication does not automatically give you a separate lawful basis for indefinite storage in a different context.

Set clear policies:

- Re-fetch on a defined schedule (weekly or biweekly for active docs, monthly for stable references)
- Delete chunks for pages that no longer exist in the sitemap or return 404
- Expire old versions unless you have a specific reason to keep them
- Keep source metadata for audit trails without retaining raw HTML you do not need
- Make re-indexing deterministic so stale chunks do not silently accumulate

## Attribution Is Not Optional

A RAG answer without a citation is an assertion without evidence. For public documentation, where the whole point is that the source is authoritative, stripping attribution defeats the purpose.

When a developer or support agent asks a question and gets an answer derived from public docs, they need to know:

- Which page the answer came from
- Which section of that page
- When the index last fetched that page
- Whether the cited content matches the current live page

This is not just a UX nicety. It is what separates a useful tool from a liability. An answer that cites "Stripe API Reference > Authentication > API Keys, retrieved April 20, 2026" is verifiable. An answer that says "you should use API keys" with no source is indistinguishable from a hallucination.

Good attribution requires ingestion-time metadata. If you throw away the source URL and heading path during conversion, you cannot reconstruct reliable citations later. This is why the metadata extraction step matters — it captures the citation anchors before you lose them.

Design the response format around citations from the beginning. Adding them later means re-processing your entire index.

## Build the Pipeline Around Source Truth

Public documentation is valuable because it is authoritative. Your RAG system should preserve that authority instead of turning docs into anonymous text chunks.

The pipeline that works:

1. Check `robots.txt` and terms before you ingest anything
2. Convert pages to clean markdown that preserves headings, code, and tables
3. Extract typed metadata for indexing, filtering, and citations
4. Chunk by documentation structure, not token windows
5. Refresh on a schedule, hash-diff to minimize re-embedding costs
6. Delete chunks when source pages disappear
7. Cite the source URL and section in every answer

For implementation, read the [Document to Markdown docs](https://iterationlayer.com/docs/document-to-markdown) for page-to-markdown conversion and the [Website Extraction docs](https://iterationlayer.com/docs/website-extraction) for schema-based metadata extraction from public pages. Both APIs accept website URLs and handle HTML cleanup, JavaScript-rendered pages, and structured output — same auth, same credit pool.