Pedro

Posted on May 8

How I Built a Markdown Conversion API for AI Agents in Rust (and deployed it for $0.000003 per request)

#rust #ai #webdev #api

How I Built a Markdown Conversion API for AI Agents in Rust (and deployed it for $0.000003 per request)

AI agents are only as good as the context you feed them. The problem? Most documentation pages are wrapped in layers of navigation menus, sidebars, cookie banners, and footer links that waste tokens and confuse models. I built CleanMark to fix that — a single-endpoint API that takes any URL and returns clean, structured Markdown ready for LLM consumption.

Here's the full story: why I built it, the technical decisions behind it, and what I learned.

The Problem

If you've ever built a RAG pipeline or an AI agent that reads documentation, you've hit this wall:

# What you want
docs = load("https://docs.stripe.com/api/charges")
agent.feed(docs)

# What you actually get
"Skip to main content · Products · Docs · Sign in · Sign up ·
☰ Menu · API Reference · Charges · ... · © 2024 Stripe, Inc.
Terms · Privacy · Cookies · Do Not Sell My Data"

LangChain's WebBaseLoader uses BeautifulSoup under the hood and grabs everything. You spend more tokens on navigation than on actual content. And if you try to clean it manually, you're writing custom scrapers for every site you touch.

I wanted a single API call that just works — send a URL, get back clean Markdown.

Why Rust + AWS Lambda

I had two constraints: low latency and near-zero cost.

Rust was the obvious choice for a stateless text-processing workload:

The binary is ~2.5 MB stripped
Memory usage stays under 25 MB at peak
No garbage collector means predictable latency
The scraper crate (built on html5ever) is fast and battle-tested

AWS Lambda ARM64 (Graviton2) for the runtime:

~20% faster and ~20% cheaper than x86_64
128 MB memory tier is enough — Rust doesn't need more
Custom runtime (provided.al2023) means no managed runtime overhead
Cold start: ~35ms. Warm invocation: ~300–500ms (dominated by the upstream HTTP fetch, not our processing)

Cost per request:

Component	Cost
Lambda compute (500ms, 128MB, ARM64)	~$0.0000021
API Gateway HTTP API	~$0.0000010
Total	~$0.000003

At 1 million requests/month: ~$3 in AWS costs. The margins make this viable as a commercial API.

Architecture

Client
  │
  ▼
RapidAPI Gateway (auth, rate limiting, metering)
  │  X-RapidAPI-Proxy-Secret injected
  ▼
AWS API Gateway HTTP API (v2)
  │
  ▼
AWS Lambda (Rust, ARM64, 128MB)
  ├── Auth check (proxy secret)
  ├── Parse request (JSON body or ?url= query param)
  ├── Fetch HTML (reqwest + rustls, 8s timeout, 5MB cap)
  ├── Parse DOM (scraper / html5ever)
  ├── Extract content (noise filtering + content selectors)
  ├── Convert to GFM Markdown (custom renderer)
  └── Post-process (clean citations, fix typography, resolve URLs)
  │
  ▼
JSON Response { title, markdown, processing_time_ms }

No database, no cache, no queue. Pure stateless function.

The HTML → Markdown Pipeline

This is where most of the work lives. Let me walk through each stage.

Stage 1: Content Selection

Before rendering anything, we need to find the main content container. HTML pages have a predictable hierarchy of candidates:

const CONTENT_SELECTORS: &[&str] = &[
    "#mw-content-text",           // Wikipedia
    "article",                    // Semantic HTML5
    "[role='main']",              // ARIA
    "main",                       // HTML5
    ".post-content",              // Common blog CMS
    ".entry-content",
    "#content", "#main-content",
    // ... more fallbacks
];

We try each selector in order and take the first match. If nothing matches, we fall back to <body>.

Stage 2: Noise Filtering

Even inside the content container, there's noise. We filter at two levels:

Tag-level — some elements are always noise:

const NOISE_TAGS: &[&str] = &[
    "script", "style", "noscript", "nav", "header", "footer",
    "aside", "form", "button", "input", "iframe", "svg", "canvas",
];

Class/ID-level — structural UI patterns:

const NOISE_WORDS: &[&str] = &[
    "navigation", "sidebar", "breadcrumb", "pagination",
    "cookie", "banner", "advertisement", "social", "share",
    "related-posts", "newsletter", "subscribe", "comments",
    // ... ~50 patterns
];

Link density check — nav-heavy blocks that slipped through get caught here:

fn is_link_dense(el: &ElementRef) -> bool {
    let total: usize = el.text().map(|t| t.len()).sum();
    if total == 0 || total >= 400 { return false; }
    let linked: usize = el.select(&a_sel)
        .flat_map(|a| a.text())
        .map(|t| t.len()).sum();
    (linked as f64 / total as f64) > 0.60
}

If more than 60% of the text in an element is inside <a> tags and the block is small, it's navigation — skip it.

Stage 3: DOM → Markdown Rendering

I wrote a custom renderer instead of using an existing crate. The main reason: full control over edge cases that matter for AI consumption (tables, nested lists, code blocks with syntax highlighting).

The renderer has two contexts:

Block context (dom_to_md): headings, paragraphs, lists, tables, code blocks
Inline context (inline_md): bold, italic, inline code, links

A few interesting decisions:

Tables get rendered as proper GFM tables with pipe syntax. LLMs handle these well and they preserve the relational structure of data.

Code blocks go through a cleaning pipeline:

fn render_code_block(el: ElementRef) -> String {
    let lang = detect_code_lang(&el);
    let raw = code_text(root);
    let trimmed = trim_code_edges(&raw);
    let cleaned = clean_code_text(&trimmed);   // strip ANSI color tags
    let formatted = maybe_format_json(&lang, &cleaned); // pretty-print JSON
    format!("\n```

{lang}\n{formatted}\n

```\n\n")
}

The clean_code_text step handles a real-world issue: some doc sites (FastAPI, for example) render terminal output with HTML color tags (<font color="">, <span style="color:...">) inside <pre> blocks. After html5ever decodes the entities, these become literal strings in the text. We strip them.

The maybe_format_json step pretty-prints single-line JSON blobs — common in API reference docs where the response example is a one-liner.

Lists track their own depth independently of DOM depth. This was the trickiest bug to fix: Stripe's docs use 6+ levels of structural <li> wrappers with no text, and naively incrementing depth on every <ul> produces absurd indentation (12+ spaces). The fix is to only increment list depth when the current <li> actually contains text.

Stage 4: Post-processing

After the raw Markdown is generated, a pipeline of pure string transformations cleans it up:

fn postprocess(md: &str, base_url: &str) -> String {
    let md = trim_before_first_heading(md);     // drop any text before the H1
    let md = remove_trailing_sections(&md);     // cut "See also", "References"
    let md = resolve_relative_urls(&md, base_url); // /docs/api → https://...
    let md = clean_citations(&md);              // remove [1], [2] footnote refs
    let md = fix_spaced_identifiers(&md);       // "receipt _ email" → "receipt_email"
    let md = normalize_typography(&md);         // curly quotes → straight
    collapse_blank_lines(&md)                   // max 2 consecutive blank lines
}

The fix_spaced_identifiers step handles Stripe docs specifically — their parameter names are displayed with spaces around underscores for visual reasons (receipt _ email), which is useless and confusing for LLMs.

Deployment

The build and deploy flow uses cargo-lambda:

# 1. Build for ARM64 Lambda
cargo lambda build --release --arm64

# 2. Re-zip (cargo-lambda doesn't update the ZIP automatically — learned this the hard way)
zip -j target/lambda/bootstrap/bootstrap.zip target/lambda/bootstrap/bootstrap

# 3. Deploy
RAPIDAPI_PROXY_SECRET="your-secret" serverless deploy

Step 2 is critical. cargo lambda build rebuilds the binary but does not update the .zip artifact if it already exists. If you skip it, Serverless Framework deploys the old binary. I chased this bug for an embarrassing amount of time.

The serverless.yml is minimal:

provider:
  name: aws
  runtime: provided.al2023
  architecture: arm64
  memorySize: 128
  timeout: 10

  httpApi:
    cors:
      allowedOrigins: ["*"]
    disableDefaultEndpoint: false

  environment:
    RAPIDAPI_PROXY_SECRET: ${env:RAPIDAPI_PROXY_SECRET, ""}
    MAX_BODY_BYTES: "5242880"

Security

The API is published on RapidAPI, which handles auth, rate limiting, and metering. On the backend, every request must carry a X-RapidAPI-Proxy-Secret header — a shared secret between RapidAPI and our Lambda that prevents callers from bypassing RapidAPI and hitting the AWS URL directly.

if let Ok(expected) = std::env::var("RAPIDAPI_PROXY_SECRET") {
    if !expected.is_empty() {
        let provided = event.payload
            .get("headers")
            .and_then(|h| h.get("x-rapidapi-proxy-secret"))
            .and_then(|v| v.as_str())
            .unwrap_or("");
        if provided != expected {
            return Ok(error_response(401, "unauthorized"));
        }
    }
}

Note: HTTP API v2 lowercases all header names before forwarding, so we check x-rapidapi-proxy-secret (lowercase), not the original casing.

Validation

Before publishing, I ran the API against 200 documentation pages across 20 batches:

Category	Pages	Result
GitHub REST API	10	✅ 9 OK, 1 minor nbsp
Stripe API	10	✅ 10 OK
MDN Web Docs	10	✅ 10 OK
React / Vue / Angular	10	✅ 10 OK
Python / Node / Rust / Go	10	✅ 10 OK
PostgreSQL / Redis / MongoDB	10	✅ 10 OK
AWS / Kubernetes / Docker	10	✅ 9 OK, 1 minor indent
FastAPI / Django / Flask	10	✅ 10 OK
GraphQL / Apollo / Prisma	10	✅ 8 OK, 2 SPA
Testing tools	10	✅ 10 OK

96/100 pages extracted cleanly in the first batch. The 4 failures were all SPAs (Nuxt, Remix — JavaScript-rendered pages with no static HTML content). That's a known limitation, not a bug.

Limitations

Being upfront about what doesn't work:

SPAs: Sites that render content via JavaScript (Notion, some dashboards, HashiCorp Developer Portal) return little or no content. You'd need a headless browser for those.
Auth-gated pages: The API fetches anonymously.
5 MB page cap: Large pages are truncated.
8s timeout: Slow servers may time out.

Using the API

Available on RapidAPI with a free tier (100 requests/month):

👉 rapidapi.com/pedroneto2/api/cleanmark

Quick example in Python:

import requests

response = requests.post(
    "https://cleanmark.p.rapidapi.com/convert",
    json={"url": "https://docs.python.org/3/library/json.html"},
    headers={
        "X-RapidAPI-Key": "YOUR_KEY",
        "X-RapidAPI-Host": "cleanmark.p.rapidapi.com"
    }
)

data = response.json()
print(data["markdown"])

With LangChain:

from langchain.schema import Document
import requests

def fetch_doc(url: str) -> Document:
    res = requests.post(
        "https://cleanmark.p.rapidapi.com/convert",
        json={"url": url},
        headers={
            "X-RapidAPI-Key": "YOUR_KEY",
            "X-RapidAPI-Host": "cleanmark.p.rapidapi.com"
        }
    )
    data = res.json()
    return Document(
        page_content=data["markdown"],
        metadata={"source": url, "title": data["title"]}
    )

What I'd do differently

Add caching. A Redis layer (Upstash would be free at this scale) could cache popular pages for a few hours. Most documentation pages don't change minute-to-minute, and eliminating the upstream fetch would cut latency from ~400ms to ~20ms for cache hits.

Streaming response. For large pages, it would be better to stream the Markdown as it's generated rather than buffering the whole thing. API Gateway HTTP API supports streaming with Lambda response streaming.

Headless browser fallback. For SPAs, spin up a Chromium instance on demand (AWS Lambda supports it with a layer). The cost would be ~10x higher per request, but it would handle Notion, Shopify, and similar sites.

Conclusion

The whole project — from idea to deployed API — took about a weekend. Rust made the core logic fast to write once I understood the scraper crate's API. The hardest parts were the edge cases: list depth tracking, code block cleaning, and the cargo-lambda zip bug.

If you're building AI agents or RAG pipelines and need a reliable way to ingest documentation, give it a try. The free tier is there, no credit card required.

Source code and full test results are available on request — happy to open source it if there's interest.

Built with Rust · AWS Lambda ARM64 · Serverless Framework · RapidAPI

DEV Community

How I Built a Markdown Conversion API for AI Agents in Rust (and deployed it for $0.000003 per request)

How I Built a Markdown Conversion API for AI Agents in Rust (and deployed it for $0.000003 per request)

The Problem

Why Rust + AWS Lambda

Architecture

The HTML → Markdown Pipeline

Stage 1: Content Selection

Stage 2: Noise Filtering

Stage 3: DOM → Markdown Rendering

Stage 4: Post-processing

Deployment

Security

Validation

Limitations

Using the API

What I'd do differently

Conclusion

Top comments (0)