DEV Community

Pedro
Pedro

Posted on

How I Built a Markdown Conversion API for AI Agents in Rust (and deployed it for $0.000003 per request)

How I Built a Markdown Conversion API for AI Agents in Rust (and deployed it for $0.000003 per request)

AI agents are only as good as the context you feed them. The problem? Most documentation pages are wrapped in layers of navigation menus, sidebars, cookie banners, and footer links that waste tokens and confuse models. I built CleanMark to fix that — a single-endpoint API that takes any URL and returns clean, structured Markdown ready for LLM consumption.

Here's the full story: why I built it, the technical decisions behind it, and what I learned.


The Problem

If you've ever built a RAG pipeline or an AI agent that reads documentation, you've hit this wall:

# What you want
docs = load("https://docs.stripe.com/api/charges")
agent.feed(docs)

# What you actually get
"Skip to main content · Products · Docs · Sign in · Sign up ·
☰ Menu · API Reference · Charges · ... · © 2024 Stripe, Inc.
Terms · Privacy · Cookies · Do Not Sell My Data"
Enter fullscreen mode Exit fullscreen mode

LangChain's WebBaseLoader uses BeautifulSoup under the hood and grabs everything. You spend more tokens on navigation than on actual content. And if you try to clean it manually, you're writing custom scrapers for every site you touch.

I wanted a single API call that just works — send a URL, get back clean Markdown.


Why Rust + AWS Lambda

I had two constraints: low latency and near-zero cost.

Rust was the obvious choice for a stateless text-processing workload:

  • The binary is ~2.5 MB stripped
  • Memory usage stays under 25 MB at peak
  • No garbage collector means predictable latency
  • The scraper crate (built on html5ever) is fast and battle-tested

AWS Lambda ARM64 (Graviton2) for the runtime:

  • ~20% faster and ~20% cheaper than x86_64
  • 128 MB memory tier is enough — Rust doesn't need more
  • Custom runtime (provided.al2023) means no managed runtime overhead
  • Cold start: ~35ms. Warm invocation: ~300–500ms (dominated by the upstream HTTP fetch, not our processing)

Cost per request:

Component Cost
Lambda compute (500ms, 128MB, ARM64) ~$0.0000021
API Gateway HTTP API ~$0.0000010
Total ~$0.000003

At 1 million requests/month: ~$3 in AWS costs. The margins make this viable as a commercial API.


Architecture

Client
  │
  ▼
RapidAPI Gateway (auth, rate limiting, metering)
  │  X-RapidAPI-Proxy-Secret injected
  ▼
AWS API Gateway HTTP API (v2)
  │
  ▼
AWS Lambda (Rust, ARM64, 128MB)
  ├── Auth check (proxy secret)
  ├── Parse request (JSON body or ?url= query param)
  ├── Fetch HTML (reqwest + rustls, 8s timeout, 5MB cap)
  ├── Parse DOM (scraper / html5ever)
  ├── Extract content (noise filtering + content selectors)
  ├── Convert to GFM Markdown (custom renderer)
  └── Post-process (clean citations, fix typography, resolve URLs)
  │
  ▼
JSON Response { title, markdown, processing_time_ms }
Enter fullscreen mode Exit fullscreen mode

No database, no cache, no queue. Pure stateless function.


The HTML → Markdown Pipeline

This is where most of the work lives. Let me walk through each stage.

Stage 1: Content Selection

Before rendering anything, we need to find the main content container. HTML pages have a predictable hierarchy of candidates:

const CONTENT_SELECTORS: &[&str] = &[
    "#mw-content-text",           // Wikipedia
    "article",                    // Semantic HTML5
    "[role='main']",              // ARIA
    "main",                       // HTML5
    ".post-content",              // Common blog CMS
    ".entry-content",
    "#content", "#main-content",
    // ... more fallbacks
];
Enter fullscreen mode Exit fullscreen mode

We try each selector in order and take the first match. If nothing matches, we fall back to <body>.

Stage 2: Noise Filtering

Even inside the content container, there's noise. We filter at two levels:

Tag-level — some elements are always noise:

const NOISE_TAGS: &[&str] = &[
    "script", "style", "noscript", "nav", "header", "footer",
    "aside", "form", "button", "input", "iframe", "svg", "canvas",
];
Enter fullscreen mode Exit fullscreen mode

Class/ID-level — structural UI patterns:

const NOISE_WORDS: &[&str] = &[
    "navigation", "sidebar", "breadcrumb", "pagination",
    "cookie", "banner", "advertisement", "social", "share",
    "related-posts", "newsletter", "subscribe", "comments",
    // ... ~50 patterns
];
Enter fullscreen mode Exit fullscreen mode

Link density check — nav-heavy blocks that slipped through get caught here:

fn is_link_dense(el: &ElementRef) -> bool {
    let total: usize = el.text().map(|t| t.len()).sum();
    if total == 0 || total >= 400 { return false; }
    let linked: usize = el.select(&a_sel)
        .flat_map(|a| a.text())
        .map(|t| t.len()).sum();
    (linked as f64 / total as f64) > 0.60
}
Enter fullscreen mode Exit fullscreen mode

If more than 60% of the text in an element is inside <a> tags and the block is small, it's navigation — skip it.

Stage 3: DOM → Markdown Rendering

I wrote a custom renderer instead of using an existing crate. The main reason: full control over edge cases that matter for AI consumption (tables, nested lists, code blocks with syntax highlighting).

The renderer has two contexts:

  • Block context (dom_to_md): headings, paragraphs, lists, tables, code blocks
  • Inline context (inline_md): bold, italic, inline code, links

A few interesting decisions:

Tables get rendered as proper GFM tables with pipe syntax. LLMs handle these well and they preserve the relational structure of data.

Code blocks go through a cleaning pipeline:

fn render_code_block(el: ElementRef) -> String {
    let lang = detect_code_lang(&el);
    let raw = code_text(root);
    let trimmed = trim_code_edges(&raw);
    let cleaned = clean_code_text(&trimmed);   // strip ANSI color tags
    let formatted = maybe_format_json(&lang, &cleaned); // pretty-print JSON
    format!("\n```

{lang}\n{formatted}\n

```\n\n")
}
Enter fullscreen mode Exit fullscreen mode

The clean_code_text step handles a real-world issue: some doc sites (FastAPI, for example) render terminal output with HTML color tags (<font color="">, <span style="color:...">) inside <pre> blocks. After html5ever decodes the entities, these become literal strings in the text. We strip them.

The maybe_format_json step pretty-prints single-line JSON blobs — common in API reference docs where the response example is a one-liner.

Lists track their own depth independently of DOM depth. This was the trickiest bug to fix: Stripe's docs use 6+ levels of structural <li> wrappers with no text, and naively incrementing depth on every <ul> produces absurd indentation (12+ spaces). The fix is to only increment list depth when the current <li> actually contains text.

Stage 4: Post-processing

After the raw Markdown is generated, a pipeline of pure string transformations cleans it up:

fn postprocess(md: &str, base_url: &str) -> String {
    let md = trim_before_first_heading(md);     // drop any text before the H1
    let md = remove_trailing_sections(&md);     // cut "See also", "References"
    let md = resolve_relative_urls(&md, base_url); // /docs/api → https://...
    let md = clean_citations(&md);              // remove [1], [2] footnote refs
    let md = fix_spaced_identifiers(&md);       // "receipt _ email" → "receipt_email"
    let md = normalize_typography(&md);         // curly quotes → straight
    collapse_blank_lines(&md)                   // max 2 consecutive blank lines
}
Enter fullscreen mode Exit fullscreen mode

The fix_spaced_identifiers step handles Stripe docs specifically — their parameter names are displayed with spaces around underscores for visual reasons (receipt _ email), which is useless and confusing for LLMs.


Deployment

The build and deploy flow uses cargo-lambda:

# 1. Build for ARM64 Lambda
cargo lambda build --release --arm64

# 2. Re-zip (cargo-lambda doesn't update the ZIP automatically — learned this the hard way)
zip -j target/lambda/bootstrap/bootstrap.zip target/lambda/bootstrap/bootstrap

# 3. Deploy
RAPIDAPI_PROXY_SECRET="your-secret" serverless deploy
Enter fullscreen mode Exit fullscreen mode

Step 2 is critical. cargo lambda build rebuilds the binary but does not update the .zip artifact if it already exists. If you skip it, Serverless Framework deploys the old binary. I chased this bug for an embarrassing amount of time.

The serverless.yml is minimal:

provider:
  name: aws
  runtime: provided.al2023
  architecture: arm64
  memorySize: 128
  timeout: 10

  httpApi:
    cors:
      allowedOrigins: ["*"]
    disableDefaultEndpoint: false

  environment:
    RAPIDAPI_PROXY_SECRET: ${env:RAPIDAPI_PROXY_SECRET, ""}
    MAX_BODY_BYTES: "5242880"
Enter fullscreen mode Exit fullscreen mode

Security

The API is published on RapidAPI, which handles auth, rate limiting, and metering. On the backend, every request must carry a X-RapidAPI-Proxy-Secret header — a shared secret between RapidAPI and our Lambda that prevents callers from bypassing RapidAPI and hitting the AWS URL directly.

if let Ok(expected) = std::env::var("RAPIDAPI_PROXY_SECRET") {
    if !expected.is_empty() {
        let provided = event.payload
            .get("headers")
            .and_then(|h| h.get("x-rapidapi-proxy-secret"))
            .and_then(|v| v.as_str())
            .unwrap_or("");
        if provided != expected {
            return Ok(error_response(401, "unauthorized"));
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Note: HTTP API v2 lowercases all header names before forwarding, so we check x-rapidapi-proxy-secret (lowercase), not the original casing.


Validation

Before publishing, I ran the API against 200 documentation pages across 20 batches:

Category Pages Result
GitHub REST API 10 ✅ 9 OK, 1 minor nbsp
Stripe API 10 ✅ 10 OK
MDN Web Docs 10 ✅ 10 OK
React / Vue / Angular 10 ✅ 10 OK
Python / Node / Rust / Go 10 ✅ 10 OK
PostgreSQL / Redis / MongoDB 10 ✅ 10 OK
AWS / Kubernetes / Docker 10 ✅ 9 OK, 1 minor indent
FastAPI / Django / Flask 10 ✅ 10 OK
GraphQL / Apollo / Prisma 10 ✅ 8 OK, 2 SPA
Testing tools 10 ✅ 10 OK

96/100 pages extracted cleanly in the first batch. The 4 failures were all SPAs (Nuxt, Remix — JavaScript-rendered pages with no static HTML content). That's a known limitation, not a bug.


Limitations

Being upfront about what doesn't work:

  • SPAs: Sites that render content via JavaScript (Notion, some dashboards, HashiCorp Developer Portal) return little or no content. You'd need a headless browser for those.
  • Auth-gated pages: The API fetches anonymously.
  • 5 MB page cap: Large pages are truncated.
  • 8s timeout: Slow servers may time out.

Using the API

Available on RapidAPI with a free tier (100 requests/month):

👉 rapidapi.com/pedroneto2/api/cleanmark

Quick example in Python:

import requests

response = requests.post(
    "https://cleanmark.p.rapidapi.com/convert",
    json={"url": "https://docs.python.org/3/library/json.html"},
    headers={
        "X-RapidAPI-Key": "YOUR_KEY",
        "X-RapidAPI-Host": "cleanmark.p.rapidapi.com"
    }
)

data = response.json()
print(data["markdown"])
Enter fullscreen mode Exit fullscreen mode

With LangChain:

from langchain.schema import Document
import requests

def fetch_doc(url: str) -> Document:
    res = requests.post(
        "https://cleanmark.p.rapidapi.com/convert",
        json={"url": url},
        headers={
            "X-RapidAPI-Key": "YOUR_KEY",
            "X-RapidAPI-Host": "cleanmark.p.rapidapi.com"
        }
    )
    data = res.json()
    return Document(
        page_content=data["markdown"],
        metadata={"source": url, "title": data["title"]}
    )
Enter fullscreen mode Exit fullscreen mode

What I'd do differently

Add caching. A Redis layer (Upstash would be free at this scale) could cache popular pages for a few hours. Most documentation pages don't change minute-to-minute, and eliminating the upstream fetch would cut latency from ~400ms to ~20ms for cache hits.

Streaming response. For large pages, it would be better to stream the Markdown as it's generated rather than buffering the whole thing. API Gateway HTTP API supports streaming with Lambda response streaming.

Headless browser fallback. For SPAs, spin up a Chromium instance on demand (AWS Lambda supports it with a layer). The cost would be ~10x higher per request, but it would handle Notion, Shopify, and similar sites.


Conclusion

The whole project — from idea to deployed API — took about a weekend. Rust made the core logic fast to write once I understood the scraper crate's API. The hardest parts were the edge cases: list depth tracking, code block cleaning, and the cargo-lambda zip bug.

If you're building AI agents or RAG pipelines and need a reliable way to ingest documentation, give it a try. The free tier is there, no credit card required.

Source code and full test results are available on request — happy to open source it if there's interest.


Built with Rust · AWS Lambda ARM64 · Serverless Framework · RapidAPI

Top comments (0)