Edgaras

Posted on May 20

A Self-Hosted Web Content Extraction API

#ai #scraping #webscraping #rag

Getting clean content out of a web page is harder than it looks, especially at scale. Every site is put together differently, so a scraper that works on one page falls apart on the next, and the part you actually care about is buried in menus, ads, cookie banners, and scripts. You can feed the whole page to an LLM and let it pull the content out, or pay for an extraction API, but both get expensive once you are handling more than a handful of pages. Many sites also render their content with JavaScript, so a plain HTTP fetch returns almost nothing to begin with.

The Web Loader Engine is a web content extraction service built in Rust that handles this. It loads each page in a headless Chromium browser and runs it through Mozilla Readability, so you get the actual content without the clutter around it. It can return Markdown, HTML, plain text, or a screenshot, all through a HTTP API call. Useful for RAG pipelines, scraping, archiving pages, and screenshots.

Quick start

Pull and run the prebuilt image:

docker run -d -p 14786:14786 --name web-loader \
  edgaras0x4e/web-loader-engine:latest

Check that it is up:

curl http://localhost:14786/health

That is the whole setup. No headless browser to install, no driver to wire up.

Your first extraction

Send a URL to /load and get structured content back:

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "Title: Example Domain\nURL Source: https://example.com\n\n---\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\n[Learn more](https://iana.org/domains/example)",
  "metadata": {
    "processing_time_ms": 4,
    "cached": false
  }
}

The content field is ready to chunk and embed. The metadata tells you how long it took and whether the response came from cache.

Pick your output format

The same endpoint returns different formats based on the x-respond-with header. Accepted values are markdown, html, text, screenshot, and pageshot.

# Plain text, no markup
curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-respond-with: text" \
  -d '{"url": "https://example.com"}'

Markdown is the default and is usually what you want for LLM input. Plain text is handy for keyword indexing, and HTML keeps structure when you need it.

Screenshots

For visual snapshots, ask for a screenshot (viewport) or pageshot (full page):

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-respond-with: screenshot" \
  -d '{"url": "https://example.com"}'

The engine renders the page, saves a PNG to SCREENSHOT_DIR, and returns a URL you can fetch from the same server:

{
  "url": "https://example.com",
  "title": null,
  "content": "",
  "screenshot_url": "/screenshots/httpsexamplecom_6eb6a747-ba80-47bf-91c5-2767aae1c5d0.png",
  "metadata": { "processing_time_ms": 1064, "cached": false }
}

Precision extraction

x-wait-for-selector waits for a CSS selector before extracting, so dynamic content has time to load.
x-target-selector extracts only a specific region of the page.
x-remove-selector strips elements you do not want, such as ads or footers.
x-with-links-summary and x-with-images-summary add a summary of links or images found on the page.
x-set-cookie and x-user-agent let you control the request, with rotate available for automatic user agent rotation.
x-no-cache: true forces a fresh fetch, bypassing the cache. Useful when you change selectors, since cached responses are returned as-is.

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-no-cache: true" \
  -H "x-wait-for-selector: h1" \
  -H "x-target-selector: div" \
  -H "x-remove-selector: a" \
  -d '{"url": "https://example.com"}'

Process URLs in batch

For crawling or bulk indexing, send many URLs at once to /load/batch and let the browser pool handle them concurrently:

curl -X POST http://localhost:14786/load/batch \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com/1", "https://example.com/2"]}'

{
  "results": [
    {
      "url": "https://example.com/1",
      "response": {
        "url": "https://example.com/1",
        "title": "Page Title",
        "content": "...",
        "metadata": { "processing_time_ms": 500, "cached": false }
      }
    }
  ],
  "total_processing_time_ms": 1234
}

The URLs are fetched in parallel rather than one after another, so a batch is much faster than looping. Four pages that each take about two seconds came back in roughly 4.4 seconds together, against about 18 seconds when fetched one at a time.

User agent rotation

Some sites get cranky when every request shows up with the same user agent. The engine can rotate through a pool of them for you. You set the strategy and the pool with environment variables, where USER_AGENT_ROTATION is off, round_robin, or random:

environment:
  - USER_AGENT_ROTATION=round_robin
  - USER_AGENT_POOL=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36|Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
  - DEFAULT_USER_AGENT=Mozilla/5.0 (compatible; web-loader)

With round_robin each request moves to the next entry in the pool. You can also decide per request with the x-user-agent header: pass a string to use it as-is, rotate to force rotation even when the strategy is off, or default to skip rotation for that one call.

curl -X POST http://localhost:14786/load \
  -H "Content-Type: application/json" \
  -H "x-user-agent: rotate" \
  -d '{"url": "https://example.com"}'

Browser pool resilience

Rendering runs through a pool of headless Chromium pages, sized by BROWSER_POOL_SIZE (default 10). That pool is what lets batches run in parallel, and it is also the part most likely to break, since browsers crash. The engine checks each browser and recreates any that die or drop their connection, and a request retries a few times instead of failing on the first hiccup. You can kill every browser in the container and the next request still comes back fine, with the pool rebuilt behind it.

The /health endpoint shows the current state:

curl http://localhost:14786/health

{
  "status": "ok",
  "version": "0.1.4",
  "browser_pool": { "available": 10, "total": 10, "healthy": true, "recreation_count": 2 }
}

recreation_count goes up each time a dead browser is replaced, so it is worth keeping an eye on for a long-running deployment.

Built-in features

Beyond extraction, the engine ships with the parts you would otherwise build yourself:

Caching with a configurable TTL, so repeated URLs come back from cache instead of being fetched again.
Per-domain rate limiting and circuit breakers, so a slow or failing host does not take down your pipeline.
SSRF protection that blocks internal IP ranges, important when you accept URLs from users.
Proxy support that honors HTTP_PROXY, HTTPS_PROXY, and NO_PROXY, including Chromium egress routing.

Configuration

Everything is driven by environment variables. The defaults are sensible, and you only override what you need:

Variable	Default	Purpose
`API_PORT`	14786	Server port
`API_KEY`	(none)	Optional bearer-token auth
`BROWSER_POOL_SIZE`	10	Concurrent browser instances
`REQUEST_TIMEOUT`	30	Timeout in seconds
`CACHE_TTL`	3600	Cache lifetime in seconds
`SCREENSHOT_DIR`	/app/screenshots	Screenshot storage

A typical Compose setup:

services:
  web-loader:
    image: edgaras0x4e/web-loader-engine:latest
    ports:
      - "14786:14786"
    environment:
      - BROWSER_POOL_SIZE=10
      - CACHE_TTL=3600
    volumes:
      - screenshots:/app/screenshots
    restart: unless-stopped

volumes:
  screenshots:

Bottom line

The point is that one service takes you from a URL to clean content. Rendering, output formats, caching, rate limiting, security, batch fetching, user agent rotation, and a browser pool that recovers on its own are all part of it, so you are not stitching together a browser driver, a parser, and a cache by hand. It runs as a single container that you point at a URL to get back something you can actually use.

Try it:

docker run -d -p 14786:14786 edgaras0x4e/web-loader-engine:latest

GitHub: https://github.com/Edgaras0x4E/web-loader-engine
Docker Hub: https://hub.docker.com/r/edgaras0x4e/web-loader-engine

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.