Khalid Hussein for Xberg.io

Posted on Jun 29

Introducing Crawlberg v1.0.0

#webdev #ai #programming #rust

We're upgrading Crawlberg to a new version: Crawlberg v1.0.0. It builds on the previous kreuzcrawl. It declares the public API frozen under the new project name. All technical features below shipped in v0.3.0 (2026-06-23); v1.0.0 is a stability declaration and rename, not a new feature release.

The four production-facing changes most likely to require operational action:

Package and env var rename - every artifact identifier has changed; see the migration table.
SSRF defense is now on by default - internal crawl targets (localhost, RFC 1918, cloud metadata) will fail without CRAWLBERG_ALLOW_PRIVATE_NETWORK=1.
CrawlError::WafBlocked is now a struct variant - exhaustive match arms will not compile until updated.
max_retries semantics changed - off-by-one fixed; max_retries=3 now produces exactly 3 retries.

Precompiled binaries cover Linux (x86_64/aarch64), macOS (ARM64 and x86_64), and Windows x64. Homebrew bottles and Docker images on GHCR are also available.

What Is Crawlberg?

Crawlberg is a web crawling engine written primarily in Rust that exposes a single consistent API across 14 language runtimes. It handles HTTP transport, JavaScript rendering, robots.txt compliance, per-domain rate limiting, SSRF safety, and structured extraction. Extension points (Frontier, RateLimiter, CrawlStore, EventEmitter, ContentFilter, WafClassifier, ProxyProvider) are injectable traits; wire in your own frontier, storage backend, or proxy pool without forking the engine.

A single scrape() call returns text, metadata, links, images, assets, JSON-LD, Open Graph tags, hreflang, favicons, headings, response headers, and clean HTML→Markdown. When a site requires JavaScript, the optional headless browser tier handles it transparently.

v1.0.0 promotes v1.0.0-rc.2 and freezes the public API under the new project name. The features described in the sections below represent the platform that 1.0.0 declares stable; they shipped in v0.3.0.

What v1.0.0 Declares Stable

These capabilities shipped in v0.3.0 (2026-06-23). v1.0.0 freezes their API and declares them production-stable under the new crawlberg package name. Engineers running 0.3.0 already have the runtime features; upgrading to 1.0.0 means: rename packages, update env vars, get the stable API contract.

Project rename: `kreuzcrawl` → `crawlberg`

The most operationally significant change is the rename. Every artifact identifier has changed:

Artifact	Old	New
Crate (crates.io)	`kreuzcrawl`	`crawlberg`
PyPI	`kreuzcrawl`	`crawlberg`
npm	`@kreuzberg/crawl`	`@xberg-io/crawlberg`
Composer	`kreuzberg/kreuzcrawl`	`xberg-io/crawlberg`
Maven groupId	`dev.kreuzberg`	`io.xberg.crawlberg`
NuGet	`KreuzbergDev.KreuzCrawl`	`XbergIo.Crawlberg` (see note below)
Go module	`github.com/kreuzberg-dev/kreuzcrawl/...`	`github.com/xberg-io/crawlberg/packages/go`
C FFI symbol prefix	`kcrawl_*`	`cberg_*`
Env vars	`KREUZBERG_*`	`CRAWLBERG_*`
Docs	`docs.kreuzcrawl.kreuzberg.dev`	`docs.crawlberg.xberg.io`

Behavior and API shape are identical. This is a rename, not a rewrite.

Tiered dispatch engine

The crawl engine chains HTTP → bypass → headless browser, driven by per-attempt signals rather than a static configuration flag. When a response indicates a WAF challenge, the engine escalates; when it succeeds, it records the outcome in per-domain state and adjusts the starting tier for subsequent visits.

Public types: Tier, EscalationStrategy, EscalationReason, AttemptOutcome, RetryDirective, RetryPolicy, WafSignal, DispatchProfile. All dispatch enums are #[non_exhaustive] — future tiers are non-breaking additions.

WAF detection with hot-reload fingerprints

A TOML fingerprint corpus (rules/waf_fingerprints.toml, 34 fingerprints) feeds an Aho-Corasick automaton. TomlClassifier::watch() watches the file with a debounced watcher and swaps the compiled automaton atomically via ArcSwap — no process restart needed. This is safe for Kubernetes ConfigMap updates: mount the TOML as a ConfigMap volume, edit it, and the running engine picks up the new corpus within seconds.

Per-domain block rates are tracked with EwmaDomainState, an exponentially weighted moving average that automatically promotes or demotes the starting tier based on recent history.

SSRF defense, on by default

Every fetch path runs URL validation before the network call and after each redirect hop. Blocked address ranges:

127.0.0.0/8 (loopback)
RFC 1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
169.254.0.0/16 (link-local, including cloud metadata endpoints such as 169.254.169.254)
0.0.0.0/8 (this-network/reserved per RFC 1122 §3.2.1.3)
224.0.0.0/4 (multicast)
IPv6 ULA fc00::/7, link-local fe80::/10, multicast ff00::/8
Any non-http(s) scheme

Three protection layers work together:

DNS-rebinding mitigation: every resolved IP must pass the policy, not just the hostname at call time.
Redirect-chain re-validation: each hop re-resolves and re-validates, bounded by ssrf.max_redirects (default 5).
Link-enqueue validation: URLs are validated against the SSRF policy before being added to the crawl frontier, not only at fetch time.

Allowlisting is available via HostMatcher (Exact/Suffix/Cidr variants). Opt out entirely with CRAWLBERG_ALLOW_PRIVATE_NETWORK=1.

Memory-bounded streaming crawl

crawl_stream() and batch_crawl_stream() previously accumulated every CrawlEvent::Page in memory. They now yield each page and drop it immediately. Based on internal measurements documented in the changelog, peak working-set drops from approximately 2.5 GB to approximately 20 MB on large crawls. The batch crawl() API; which returns all pages at once; is unchanged.

MCP server and full CLI parity

The CLI exposes batch-scrape, batch-crawl, download, citations, and version; 1:1 with the core and MCP surfaces. The MCP server serves tools over both stdio and rmcp Streamable HTTP at /mcp. HTTP transport requires the binary to be compiled with the api and mcp Cargo features; the release CLI binary includes both. Each tool carries read_only/destructive/open_world safety annotations for agent orchestration frameworks that need to reason about side effects before calling tools.

Public substrate parsers

crawlberg::robots and crawlberg::sitemap are now public modules, usable without spinning up the full crawl engine. parse_robots_txt, is_path_allowed, RobotsRules, parse_sitemap_xml, parse_sitemap_index, and is_sitemap_index are all available standalone; useful for robots/sitemap preprocessing in pipelines that manage their own fetch layer.

Deep Technical Highlights

Escalation budget injection

EscalationBudget is a user-injectable trait. You can implement per-domain, per-hour browser budget caps, or tie escalation policy to real-time proxy cost signals. The built-in EwmaDomainState is designed for zero-configuration deployment; the trait interface is designed for when you have stronger opinions.

Lock-free corpus hot-reload

The WAF automaton lives under an ArcSwap. Readers take a guard for the duration of a single classification call; on the order of microseconds; and never contend with the writer. The writer side compiles a new automaton (tens of milliseconds for a large corpus) and swaps it in a single atomic store. In-flight requests complete against the old automaton; new requests use the new one immediately. Readers never block on corpus updates.

SSRF at every redirect hop

URL validation at the call site alone is insufficient: a hostname can pass the initial check, then DNS resolves to a private address after a short TTL expires (DNS rebinding). Crawlberg re-resolves and re-validates at every redirect, bounded by ssrf.max_redirects (default 5). The SsrfPolicy::from_env serde default means CrawlConfig deserialized from JSON automatically honors the environment variable; important for container deployments where env vars are the primary configuration channel.

Browser pool lifecycle

BrowserPool is public. Construct and warm(n) the pool at startup; pass it via CrawlEngineBuilder::with_browser_pool(). Browser instances are reused across crawl jobs rather than spawned per escalation event. CrawlEngineHandle::from_engine() produces a cloneable handle, so multiple async tasks can share a single engine and pool without additional coordination.

Asset downloads through the SSRF filter

Before this release, download_documents was honored only by single-page scrape(); the crawl loop fetched, flagged, and discarded the bytes. Downloads now route through http_fetch; the same transport as page fetches; so every file download is subject to the SSRF policy and per-domain rate limiting.

Performance Implications

The most directly measurable change is streaming memory: ~2.5 GB → ~20 MB peak working-set on large crawls (figures from the changelog; no external benchmark suite has been published for this release). The practical implication is that crawl corpus size is no longer bounded by available RAM when using crawl_stream.

Throughput: the tiered dispatch model adds a small latency overhead for requests that escalate; one additional HTTP probe before browser spin-up. Domains that respond normally to plain HTTP never pay this cost. EWMA per-domain state promotes well-behaved domains to start at the HTTP tier, avoiding unnecessary bypass or browser escalation for clean domains.

The ArcSwap-backed corpus reload is lock-free from the reader's perspective, so fingerprint corpus updates do not introduce latency spikes in production.

No benchmark numbers for throughput, requests-per-second, or latency percentiles are published in this release. Teams evaluating Crawlberg for high-throughput workloads should run their own benchmarks against the stable 1.0.0 surface; the Criterion benchmarks in the repository cover the WAF subsystem and are a starting point for extending coverage.

Language Bindings Spotlight

All 14 bindings are generated from the same Rust core by alef and contain no per-language extraction logic. The code snippets below are illustrative; check the per-language READMEs for exact API signatures.

Python

from crawlberg import CrawlConfig, scrape, crawl_stream

config = CrawlConfig(max_depth=3, max_pages=500, concurrency=20)

# Single-page extraction
result = await scrape("https://example.com", config=config)
print(result.markdown)

# Memory-bounded streaming crawl
async for event in crawl_stream("https://example.com", config=config):
    if event.page:
        print(event.page.url, event.page.title)

pip install crawlberg

JavaScript / Node.js

import { scrape, crawlStream, CrawlConfig } from "@xberg-io/crawlberg";

const config = new CrawlConfig({ maxDepth: 3, maxPages: 500, concurrency: 20 });

const result = await scrape("https://example.com", { config });
console.log(result.markdown);

for await (const event of crawlStream("https://example.com", { config })) {
  if (event.page) console.log(event.page.url, event.page.title);
}

npm install @xberg-io/crawlberg

PHP

use XbergIo\Crawlberg\CrawlConfig;
use XbergIo\Crawlberg\Crawlberg;

$config = (new CrawlConfig())
    ->withMaxDepth(3)
    ->withMaxPages(500)
    ->withConcurrency(20);

$result = Crawlberg::scrape('https://example.com', $config);
echo $result->markdown;

composer require xberg-io/crawlberg

PHP 8.2, 8.3, and 8.4 are supported; precompiled NTS extensions ship for Linux (glibc, aarch64/x86_64), macOS (ARM64/x86_64), and Windows (VS16/VS17 x86_64).

Breaking Changes / Compatibility Notes

v1.0.0 is a breaking release for users of pre-release kreuzcrawl / kreuzberg-namespaced packages. For users already on the crawlberg name from 0.x pre-releases, the behavioral breaking changes are:

Area	Old behavior	New behavior	Action required
Package identifiers	`kreuzcrawl` everywhere	`crawlberg` / `@xberg-io/crawlberg` etc.	Update dependency declarations in all manifests
Env vars	`KREUZBERG_*`	`CRAWLBERG_*`	Update shell configs, CI env blocks, K8s Secrets
C FFI symbols	`kcrawl_*`	`cberg_*`	Recompile; update header includes and linker references
Go module path	`github.com/kreuzberg-dev/kreuzcrawl/...`	`github.com/xberg-io/crawlberg/packages/go`	`go get` new path; update all import statements
`CrawlError::WafBlocked`	Unit variant	Struct variant `{ vendor, message }`	Update match arms to destructure
`NetworkErrorKind`	Exhaustive enum	`#[non_exhaustive]` applied	Add wildcard `_` arms to exhaustive matches; recompile
`CrawlError` / dispatch enums	Exhaustive enums	`#[non_exhaustive]` applied	Add wildcard `_` arms to exhaustive matches; recompile
`SimpleRetryPolicy`	`max_retries=3` → 2 actual retries (off-by-one)	`max_retries=3` → 3 actual retries (fixed)	Audit retry budgets if behavior depended on the old count
`DomainStatePort`	Mutation model	Observation model (`recommend`/`observe`)	Update trait implementations if you implemented this trait
SSRF policy	Disabled by default	Enabled by default	Add `CRAWLBERG_ALLOW_PRIVATE_NETWORK=1` for internal crawls

Upgrade Guide

Step-by-step for existing `kreuzcrawl` users

1. Update package declarations

# Python
pip install crawlberg        # replaces kreuzcrawl

# Node.js
npm install @xberg-io/crawlberg   # replaces @kreuzberg/crawl

# PHP
composer require xberg-io/crawlberg   # replaces kreuzberg/kreuzcrawl

# Go
go get github.com/xberg-io/crawlberg/packages/go@v1.0.0

# Rust
cargo add crawlberg@1.0.0    # replaces kreuzcrawl

# C# - verify the exact package ID in the C# README before running
dotnet add package XbergIo.Crawlberg

2. Update environment variables and configuration

# Old
KREUZBERG_ALLOW_PRIVATE_NETWORK=1
# New
CRAWLBERG_ALLOW_PRIVATE_NETWORK=1

3. Audit SSRF settings before first run

If you crawl internal networks (CI test targets, internal APIs, localhost), set CRAWLBERG_ALLOW_PRIVATE_NETWORK=1 before upgrading. Without it, requests to RFC 1918 and loopback addresses will fail with CrawlError::SsrfPolicyViolation.

4. Update C FFI call sites (if applicable)

Replace all kcrawl_ symbol references with cberg_. Regenerate cbindgen headers.

5. Fix CrawlError::WafBlocked match arms

// Old - unit variant
CrawlError::WafBlocked => { /* ... */ }

// New - struct variant
CrawlError::WafBlocked { vendor, message } => {
    eprintln!("WAF block: {vendor}: {message}");
}

6. Add wildcard arms for #[non_exhaustive] enums

CrawlError, NetworkErrorKind, and all dispatch enums (EscalationReason, EscalationStrategy, etc.) are now #[non_exhaustive]. Any exhaustive match on these types will fail to compile until a wildcard _ => { ... } arm is added.

Verification checklist

[ ] crawlberg --version prints 1.0.0
[ ] scrape() returns a result for a known public URL
[ ] A streaming crawl over a multi-page site completes without OOM
[ ] CrawlError::WafBlocked match arms compile (struct variant)
[ ] All CRAWLBERG_* env vars are present in CI/CD
[ ] No residual KREUZBERG_* vars shadow the new names in your process environment
[ ] C FFI header compilation succeeds with cberg_ symbol names

Rollback

Previous kreuzcrawl packages remain published at their last version. Pin your dependency to the last kreuzcrawl version and revert env vars if you need to roll back while investigating issues.

Operational Guidance for Production

Tuning recommendations

Concurrency: Start at 10–20 per domain. The global concurrency cap and per-domain rate limiter are enforced independently; the global cap prevents resource exhaustion, the per-domain limiter prevents hammering individual targets.
Depth and page limits: Set max_depth and max_pages conservatively. Even with crawl_stream's bounded memory, an uncapped frontier can grow large on deeply-linked sites.
Retry budget: max_retries is now exact (off-by-one fixed). If your configuration relied on the old count, validate your retry/backoff math before deploying.
Browser pool pre-warming: Call BrowserPool::warm(n) at startup. Lazy browser spin-up adds significant latency to the first escalated request per domain; pre-warming eliminates that spike.
Proxy rotation: Implement ProxyProvider for production anti-blocking workloads. The trait is async; you can call an external proxy API per request without blocking the crawl loop.

Observability checklist

Wire crawlberg_waf_fingerprint_matches_total and crawlberg_escalations_total into your metrics system. These counters are available via the OpenTelemetry integration; enabling OTLP export requires a Cargo feature — check the crate's Cargo.toml for the exact feature name before adding it to your dependency declaration.
Set RUST_LOG=crawlberg=info for structured tracing output in production; =debug for request-level detail.
Treat CrawlError::SsrfPolicyViolation as a security event; log the violating URL and source.
Track the ratio of HTTP-tier successes to browser-tier escalations per domain. A sustained high escalation ratio for a domain you expect to be cooperative signals a misconfigured fingerprint corpus or an actual WAF rollout.

Failure modes and mitigations

Mode	Signal	Mitigation
Rate-limited by target	HTTP 429, escalation signal indicating rate limiting (check `EscalationReason` type docs for the exact variant)	Increase `rate_limit_delay_ms`; reduce concurrency
WAF block loop	`crawlberg_escalations_total` elevated for one domain	Inspect fingerprint corpus; tune `EscalationBudget`
OOM on large crawl	RSS growing unboundedly	Confirm you are using `crawl_stream`, not `crawl`
SSRF violations in CI	`SsrfPolicyViolation` errors on test targets	Add `CRAWLBERG_ALLOW_PRIVATE_NETWORK=1` to CI environment
Browser pool exhausted	Slow escalation, request queue buildup	Increase pool size or reduce browser-tier concurrency
MCP HTTP endpoint not responding	No tools returned from `/mcp`	Verify binary was built with the `api` and `mcp` Cargo features
Swift package not resolving	SwiftPM `branch not found` error	Upgrade to 1.0.0; the `release/swift/1.0.0` branch is now correctly created

Security and Responsible Crawling

Crawlberg's default-on SSRF defense closes the most common server-side request forgery vector for multi-tenant deployments, but responsible crawling requires attention beyond internal network safety:

robots.txt: The engine fetches and respects robots.txt automatically. If a site specifies a Crawl-delay, set rate_limit_delay_ms to at least that value.
Rate limiting: As a general starting point (not a framework default), 1,000–2,000 ms between requests per domain avoids hammering most public sites. Aggressive crawling harms target infrastructure regardless of technical capability.
User-Agent: Set a descriptive User-Agent that identifies your crawler and includes a contact URL or email address. This gives site operators a channel to reach you.
Terms of service: SSRF defense and robots.txt compliance are technical safeguards, not legal authorization. Review the ToS of any site you crawl at scale.
Data retention: Crawled content may contain personal data. Apply your jurisdiction's retention and deletion requirements.

Contributor and Ecosystem Notes

Crawlberg's 14 language bindings are generated by alef (pinned at 0.26.6 in this release). Contributing to a binding means editing alef templates, not the generated crates directly.

Areas where the community can contribute:

WAF fingerprint corpus: rules/waf_fingerprints.toml benefits from real-world signal data. PRs adding fingerprints with test cases and site-class annotations are a concrete, low-barrier contribution.
Per-language ergonomics: The generated APIs are consistent but conservative. Per-language maintainers are welcome to propose binding-layer ergonomic improvements that stay within the generated API contract.
Benchmarks: Criterion benchmarks for the WAF subsystem ship in the repo. Throughput and latency benchmarks for the HTTP and streaming layers are an open gap.
MCP integrations: The MCP server opens Crawlberg to any agent framework that speaks Model Context Protocol. Reference integration guides for popular frameworks (LangChain, CrewAI, Pydantic AI) are high-value additions.
Docs: docs.crawlberg.xberg.io is built from docs/. Use-case guides — data pipelines, AI agent integrations, e-commerce monitoring, academic web archiving; are welcome.

FAQ

Q: Is this a drop-in replacement for kreuzcrawl?
Functionally yes; the API shape is identical. The package names, env vars, C FFI symbols, and Go module path have changed. Follow the upgrade guide; the migration is mechanical.

Q: Does the SSRF defense break internal test environments?
It will if your tests crawl localhost or RFC 1918 addresses. Set CRAWLBERG_ALLOW_PRIVATE_NETWORK=1 in your test environment, or call CrawlConfig::allow_private_networks(true) in test setup code.

Q: When should I use crawl_stream instead of crawl?
Use crawl_stream for any crawl larger than a few hundred pages. It bounds peak memory at ~20 MB regardless of corpus size. Use the batch crawl() only when you need the full result set in memory at once — which is uncommon in production pipelines.

Q: Is the 1.0.0 API stable across all 14 bindings?
The Rust crate, CLI, and MCP tool definitions are stable. For Elixir specifically: the repository's Quick Start currently shows {:crawlberg, "~> 0.3"} — verify the current published version at hex.pm/packages/crawlberg before pinning.

Q: Does tiered dispatch slow down crawls that don't need browser rendering?
No. Domains that respond normally at the HTTP tier never escalate. EWMA per-domain state promotes well-behaved domains to start at the HTTP tier, so they avoid bypass and browser probes entirely after the initial warm-up within a session.

Learn more

Release: github.com/xberg-io/crawlberg/releases/tag/v1.0.0
Repository: github.com/xberg-io/crawlberg
Documentation: docs.crawlberg.xberg.io
Getting started: pip install crawlberg · npm install @xberg-io/crawlberg · cargo add crawlberg
Discord: discord.gg/xt9WY3GnKR

DEV Community

Introducing Crawlberg v1.0.0

What Is Crawlberg?

What v1.0.0 Declares Stable

Project rename: `kreuzcrawl` → `crawlberg`

Tiered dispatch engine

WAF detection with hot-reload fingerprints

SSRF defense, on by default

Memory-bounded streaming crawl

MCP server and full CLI parity

Public substrate parsers

Deep Technical Highlights

Escalation budget injection

Lock-free corpus hot-reload

SSRF at every redirect hop

Browser pool lifecycle

Asset downloads through the SSRF filter

Performance Implications

Language Bindings Spotlight

Python

JavaScript / Node.js

PHP

Breaking Changes / Compatibility Notes

Upgrade Guide

Step-by-step for existing `kreuzcrawl` users

Operational Guidance for Production

Tuning recommendations

Observability checklist

Failure modes and mitigations

Security and Responsible Crawling

Contributor and Ecosystem Notes

FAQ

Learn more

Top comments (0)

What Is Crawlberg?

What v1.0.0 Declares Stable

Project rename: kreuzcrawl → crawlberg

Tiered dispatch engine

WAF detection with hot-reload fingerprints

SSRF defense, on by default

Memory-bounded streaming crawl

MCP server and full CLI parity

Public substrate parsers

Deep Technical Highlights

Escalation budget injection

Lock-free corpus hot-reload

SSRF at every redirect hop

Browser pool lifecycle

Asset downloads through the SSRF filter

Performance Implications

Language Bindings Spotlight

Python

JavaScript / Node.js

PHP

Breaking Changes / Compatibility Notes

Upgrade Guide

Step-by-step for existing kreuzcrawl users

Operational Guidance for Production

Tuning recommendations

Observability checklist

Failure modes and mitigations

Security and Responsible Crawling

Contributor and Ecosystem Notes

FAQ

Learn more

Project rename: `kreuzcrawl` → `crawlberg`

Step-by-step for existing `kreuzcrawl` users