How I Built a Markdown Conversion API for AI Agents in Rust (and deployed it for $0.000003 per request)
AI agents are only as good as the context you feed them. The problem? Most documentation pages are wrapped in layers of navigation menus, sidebars, cookie banners, and footer links that waste tokens and confuse models. I built CleanMark to fix that — a single-endpoint API that takes any URL and returns clean, structured Markdown ready for LLM consumption.
Here's the full story: why I built it, the technical decisions behind it, and what I learned.
The Problem
If you've ever built a RAG pipeline or an AI agent that reads documentation, you've hit this wall:
# What you want
docs = load("https://docs.stripe.com/api/charges")
agent.feed(docs)
# What you actually get
"Skip to main content · Products · Docs · Sign in · Sign up ·
☰ Menu · API Reference · Charges · ... · © 2024 Stripe, Inc.
Terms · Privacy · Cookies · Do Not Sell My Data"
LangChain's WebBaseLoader uses BeautifulSoup under the hood and grabs everything. You spend more tokens on navigation than on actual content. And if you try to clean it manually, you're writing custom scrapers for every site you touch.
I wanted a single API call that just works — send a URL, get back clean Markdown.
Why Rust + AWS Lambda
I had two constraints: low latency and near-zero cost.
Rust was the obvious choice for a stateless text-processing workload:
- The binary is ~2.5 MB stripped
- Memory usage stays under 25 MB at peak
- No garbage collector means predictable latency
- The
scrapercrate (built onhtml5ever) is fast and battle-tested
AWS Lambda ARM64 (Graviton2) for the runtime:
- ~20% faster and ~20% cheaper than x86_64
- 128 MB memory tier is enough — Rust doesn't need more
- Custom runtime (
provided.al2023) means no managed runtime overhead - Cold start: ~35ms. Warm invocation: ~300–500ms (dominated by the upstream HTTP fetch, not our processing)
Cost per request:
| Component | Cost |
|---|---|
| Lambda compute (500ms, 128MB, ARM64) | ~$0.0000021 |
| API Gateway HTTP API | ~$0.0000010 |
| Total | ~$0.000003 |
At 1 million requests/month: ~$3 in AWS costs. The margins make this viable as a commercial API.
Architecture
Client
│
▼
RapidAPI Gateway (auth, rate limiting, metering)
│ X-RapidAPI-Proxy-Secret injected
▼
AWS API Gateway HTTP API (v2)
│
▼
AWS Lambda (Rust, ARM64, 128MB)
├── Auth check (proxy secret)
├── Parse request (JSON body or ?url= query param)
├── Fetch HTML (reqwest + rustls, 8s timeout, 5MB cap)
├── Parse DOM (scraper / html5ever)
├── Extract content (noise filtering + content selectors)
├── Convert to GFM Markdown (custom renderer)
└── Post-process (clean citations, fix typography, resolve URLs)
│
▼
JSON Response { title, markdown, processing_time_ms }
No database, no cache, no queue. Pure stateless function.
The HTML → Markdown Pipeline
This is where most of the work lives. Let me walk through each stage.
Stage 1: Content Selection
Before rendering anything, we need to find the main content container. HTML pages have a predictable hierarchy of candidates:
const CONTENT_SELECTORS: &[&str] = &[
"#mw-content-text", // Wikipedia
"article", // Semantic HTML5
"[role='main']", // ARIA
"main", // HTML5
".post-content", // Common blog CMS
".entry-content",
"#content", "#main-content",
// ... more fallbacks
];
We try each selector in order and take the first match. If nothing matches, we fall back to <body>.
Stage 2: Noise Filtering
Even inside the content container, there's noise. We filter at two levels:
Tag-level — some elements are always noise:
const NOISE_TAGS: &[&str] = &[
"script", "style", "noscript", "nav", "header", "footer",
"aside", "form", "button", "input", "iframe", "svg", "canvas",
];
Class/ID-level — structural UI patterns:
const NOISE_WORDS: &[&str] = &[
"navigation", "sidebar", "breadcrumb", "pagination",
"cookie", "banner", "advertisement", "social", "share",
"related-posts", "newsletter", "subscribe", "comments",
// ... ~50 patterns
];
Link density check — nav-heavy blocks that slipped through get caught here:
fn is_link_dense(el: &ElementRef) -> bool {
let total: usize = el.text().map(|t| t.len()).sum();
if total == 0 || total >= 400 { return false; }
let linked: usize = el.select(&a_sel)
.flat_map(|a| a.text())
.map(|t| t.len()).sum();
(linked as f64 / total as f64) > 0.60
}
If more than 60% of the text in an element is inside <a> tags and the block is small, it's navigation — skip it.
Stage 3: DOM → Markdown Rendering
I wrote a custom renderer instead of using an existing crate. The main reason: full control over edge cases that matter for AI consumption (tables, nested lists, code blocks with syntax highlighting).
The renderer has two contexts:
-
Block context (
dom_to_md): headings, paragraphs, lists, tables, code blocks -
Inline context (
inline_md): bold, italic, inline code, links
A few interesting decisions:
Tables get rendered as proper GFM tables with pipe syntax. LLMs handle these well and they preserve the relational structure of data.
Code blocks go through a cleaning pipeline:
fn render_code_block(el: ElementRef) -> String {
let lang = detect_code_lang(&el);
let raw = code_text(root);
let trimmed = trim_code_edges(&raw);
let cleaned = clean_code_text(&trimmed); // strip ANSI color tags
let formatted = maybe_format_json(&lang, &cleaned); // pretty-print JSON
format!("\n```
{lang}\n{formatted}\n
```\n\n")
}
The clean_code_text step handles a real-world issue: some doc sites (FastAPI, for example) render terminal output with HTML color tags (<font color="">, <span style="color:...">) inside <pre> blocks. After html5ever decodes the entities, these become literal strings in the text. We strip them.
The maybe_format_json step pretty-prints single-line JSON blobs — common in API reference docs where the response example is a one-liner.
Lists track their own depth independently of DOM depth. This was the trickiest bug to fix: Stripe's docs use 6+ levels of structural <li> wrappers with no text, and naively incrementing depth on every <ul> produces absurd indentation (12+ spaces). The fix is to only increment list depth when the current <li> actually contains text.
Stage 4: Post-processing
After the raw Markdown is generated, a pipeline of pure string transformations cleans it up:
fn postprocess(md: &str, base_url: &str) -> String {
let md = trim_before_first_heading(md); // drop any text before the H1
let md = remove_trailing_sections(&md); // cut "See also", "References"
let md = resolve_relative_urls(&md, base_url); // /docs/api → https://...
let md = clean_citations(&md); // remove [1], [2] footnote refs
let md = fix_spaced_identifiers(&md); // "receipt _ email" → "receipt_email"
let md = normalize_typography(&md); // curly quotes → straight
collapse_blank_lines(&md) // max 2 consecutive blank lines
}
The fix_spaced_identifiers step handles Stripe docs specifically — their parameter names are displayed with spaces around underscores for visual reasons (receipt _ email), which is useless and confusing for LLMs.
Deployment
The build and deploy flow uses cargo-lambda:
# 1. Build for ARM64 Lambda
cargo lambda build --release --arm64
# 2. Re-zip (cargo-lambda doesn't update the ZIP automatically — learned this the hard way)
zip -j target/lambda/bootstrap/bootstrap.zip target/lambda/bootstrap/bootstrap
# 3. Deploy
RAPIDAPI_PROXY_SECRET="your-secret" serverless deploy
Step 2 is critical. cargo lambda build rebuilds the binary but does not update the .zip artifact if it already exists. If you skip it, Serverless Framework deploys the old binary. I chased this bug for an embarrassing amount of time.
The serverless.yml is minimal:
provider:
name: aws
runtime: provided.al2023
architecture: arm64
memorySize: 128
timeout: 10
httpApi:
cors:
allowedOrigins: ["*"]
disableDefaultEndpoint: false
environment:
RAPIDAPI_PROXY_SECRET: ${env:RAPIDAPI_PROXY_SECRET, ""}
MAX_BODY_BYTES: "5242880"
Security
The API is published on RapidAPI, which handles auth, rate limiting, and metering. On the backend, every request must carry a X-RapidAPI-Proxy-Secret header — a shared secret between RapidAPI and our Lambda that prevents callers from bypassing RapidAPI and hitting the AWS URL directly.
if let Ok(expected) = std::env::var("RAPIDAPI_PROXY_SECRET") {
if !expected.is_empty() {
let provided = event.payload
.get("headers")
.and_then(|h| h.get("x-rapidapi-proxy-secret"))
.and_then(|v| v.as_str())
.unwrap_or("");
if provided != expected {
return Ok(error_response(401, "unauthorized"));
}
}
}
Note: HTTP API v2 lowercases all header names before forwarding, so we check x-rapidapi-proxy-secret (lowercase), not the original casing.
Validation
Before publishing, I ran the API against 200 documentation pages across 20 batches:
| Category | Pages | Result |
|---|---|---|
| GitHub REST API | 10 | ✅ 9 OK, 1 minor nbsp |
| Stripe API | 10 | ✅ 10 OK |
| MDN Web Docs | 10 | ✅ 10 OK |
| React / Vue / Angular | 10 | ✅ 10 OK |
| Python / Node / Rust / Go | 10 | ✅ 10 OK |
| PostgreSQL / Redis / MongoDB | 10 | ✅ 10 OK |
| AWS / Kubernetes / Docker | 10 | ✅ 9 OK, 1 minor indent |
| FastAPI / Django / Flask | 10 | ✅ 10 OK |
| GraphQL / Apollo / Prisma | 10 | ✅ 8 OK, 2 SPA |
| Testing tools | 10 | ✅ 10 OK |
96/100 pages extracted cleanly in the first batch. The 4 failures were all SPAs (Nuxt, Remix — JavaScript-rendered pages with no static HTML content). That's a known limitation, not a bug.
Limitations
Being upfront about what doesn't work:
- SPAs: Sites that render content via JavaScript (Notion, some dashboards, HashiCorp Developer Portal) return little or no content. You'd need a headless browser for those.
- Auth-gated pages: The API fetches anonymously.
- 5 MB page cap: Large pages are truncated.
- 8s timeout: Slow servers may time out.
Using the API
Available on RapidAPI with a free tier (100 requests/month):
👉 rapidapi.com/pedroneto2/api/cleanmark
Quick example in Python:
import requests
response = requests.post(
"https://cleanmark.p.rapidapi.com/convert",
json={"url": "https://docs.python.org/3/library/json.html"},
headers={
"X-RapidAPI-Key": "YOUR_KEY",
"X-RapidAPI-Host": "cleanmark.p.rapidapi.com"
}
)
data = response.json()
print(data["markdown"])
With LangChain:
from langchain.schema import Document
import requests
def fetch_doc(url: str) -> Document:
res = requests.post(
"https://cleanmark.p.rapidapi.com/convert",
json={"url": url},
headers={
"X-RapidAPI-Key": "YOUR_KEY",
"X-RapidAPI-Host": "cleanmark.p.rapidapi.com"
}
)
data = res.json()
return Document(
page_content=data["markdown"],
metadata={"source": url, "title": data["title"]}
)
What I'd do differently
Add caching. A Redis layer (Upstash would be free at this scale) could cache popular pages for a few hours. Most documentation pages don't change minute-to-minute, and eliminating the upstream fetch would cut latency from ~400ms to ~20ms for cache hits.
Streaming response. For large pages, it would be better to stream the Markdown as it's generated rather than buffering the whole thing. API Gateway HTTP API supports streaming with Lambda response streaming.
Headless browser fallback. For SPAs, spin up a Chromium instance on demand (AWS Lambda supports it with a layer). The cost would be ~10x higher per request, but it would handle Notion, Shopify, and similar sites.
Conclusion
The whole project — from idea to deployed API — took about a weekend. Rust made the core logic fast to write once I understood the scraper crate's API. The hardest parts were the edge cases: list depth tracking, code block cleaning, and the cargo-lambda zip bug.
If you're building AI agents or RAG pipelines and need a reliable way to ingest documentation, give it a try. The free tier is there, no credit card required.
Source code and full test results are available on request — happy to open source it if there's interest.
Built with Rust · AWS Lambda ARM64 · Serverless Framework · RapidAPI
Top comments (0)