We're upgrading Crawlberg to a new version: Crawlberg v1.0.0. It builds on the previous kreuzcrawl. It declares the public API frozen under the new project name. All technical features below shipped in v0.3.0 (2026-06-23); v1.0.0 is a stability declaration and rename, not a new feature release.
The four production-facing changes most likely to require operational action:
- Package and env var rename - every artifact identifier has changed; see the migration table.
-
SSRF defense is now on by default - internal crawl targets (localhost, RFC 1918, cloud metadata) will fail without
CRAWLBERG_ALLOW_PRIVATE_NETWORK=1. -
CrawlError::WafBlockedis now a struct variant - exhaustive match arms will not compile until updated. -
max_retriessemantics changed - off-by-one fixed;max_retries=3now produces exactly 3 retries.
Precompiled binaries cover Linux (x86_64/aarch64), macOS (ARM64 and x86_64), and Windows x64. Homebrew bottles and Docker images on GHCR are also available.
What Is Crawlberg?
Crawlberg is a web crawling engine written primarily in Rust that exposes a single consistent API across 14 language runtimes. It handles HTTP transport, JavaScript rendering, robots.txt compliance, per-domain rate limiting, SSRF safety, and structured extraction. Extension points (Frontier, RateLimiter, CrawlStore, EventEmitter, ContentFilter, WafClassifier, ProxyProvider) are injectable traits; wire in your own frontier, storage backend, or proxy pool without forking the engine.
A single scrape() call returns text, metadata, links, images, assets, JSON-LD, Open Graph tags, hreflang, favicons, headings, response headers, and clean HTML→Markdown. When a site requires JavaScript, the optional headless browser tier handles it transparently.
v1.0.0 promotes v1.0.0-rc.2 and freezes the public API under the new project name. The features described in the sections below represent the platform that 1.0.0 declares stable; they shipped in v0.3.0.
What v1.0.0 Declares Stable
These capabilities shipped in v0.3.0 (2026-06-23). v1.0.0 freezes their API and declares them production-stable under the new
crawlbergpackage name. Engineers running 0.3.0 already have the runtime features; upgrading to 1.0.0 means: rename packages, update env vars, get the stable API contract.
Project rename: kreuzcrawl → crawlberg
The most operationally significant change is the rename. Every artifact identifier has changed:
| Artifact | Old | New |
|---|---|---|
| Crate (crates.io) | kreuzcrawl |
crawlberg |
| PyPI | kreuzcrawl |
crawlberg |
| npm | @kreuzberg/crawl |
@xberg-io/crawlberg |
| Composer | kreuzberg/kreuzcrawl |
xberg-io/crawlberg |
| Maven groupId | dev.kreuzberg |
io.xberg.crawlberg |
| NuGet | KreuzbergDev.KreuzCrawl |
XbergIo.Crawlberg (see note below)
|
| Go module | github.com/kreuzberg-dev/kreuzcrawl/... |
github.com/xberg-io/crawlberg/packages/go |
| C FFI symbol prefix | kcrawl_* |
cberg_* |
| Env vars | KREUZBERG_* |
CRAWLBERG_* |
| Docs | docs.kreuzcrawl.kreuzberg.dev |
docs.crawlberg.xberg.io |
Behavior and API shape are identical. This is a rename, not a rewrite.
Tiered dispatch engine
The crawl engine chains HTTP → bypass → headless browser, driven by per-attempt signals rather than a static configuration flag. When a response indicates a WAF challenge, the engine escalates; when it succeeds, it records the outcome in per-domain state and adjusts the starting tier for subsequent visits.
Public types: Tier, EscalationStrategy, EscalationReason, AttemptOutcome, RetryDirective, RetryPolicy, WafSignal, DispatchProfile. All dispatch enums are #[non_exhaustive] — future tiers are non-breaking additions.
WAF detection with hot-reload fingerprints
A TOML fingerprint corpus (rules/waf_fingerprints.toml, 34 fingerprints) feeds an Aho-Corasick automaton. TomlClassifier::watch() watches the file with a debounced watcher and swaps the compiled automaton atomically via ArcSwap — no process restart needed. This is safe for Kubernetes ConfigMap updates: mount the TOML as a ConfigMap volume, edit it, and the running engine picks up the new corpus within seconds.
Per-domain block rates are tracked with EwmaDomainState, an exponentially weighted moving average that automatically promotes or demotes the starting tier based on recent history.
SSRF defense, on by default
Every fetch path runs URL validation before the network call and after each redirect hop. Blocked address ranges:
-
127.0.0.0/8(loopback) - RFC 1918 private ranges (
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16) -
169.254.0.0/16(link-local, including cloud metadata endpoints such as169.254.169.254) -
0.0.0.0/8(this-network/reserved per RFC 1122 §3.2.1.3) -
224.0.0.0/4(multicast) - IPv6 ULA
fc00::/7, link-localfe80::/10, multicastff00::/8 - Any non-
http(s)scheme
Three protection layers work together:
- DNS-rebinding mitigation: every resolved IP must pass the policy, not just the hostname at call time.
-
Redirect-chain re-validation: each hop re-resolves and re-validates, bounded by
ssrf.max_redirects(default 5). - Link-enqueue validation: URLs are validated against the SSRF policy before being added to the crawl frontier, not only at fetch time.
Allowlisting is available via HostMatcher (Exact/Suffix/Cidr variants). Opt out entirely with CRAWLBERG_ALLOW_PRIVATE_NETWORK=1.
Memory-bounded streaming crawl
crawl_stream() and batch_crawl_stream() previously accumulated every CrawlEvent::Page in memory. They now yield each page and drop it immediately. Based on internal measurements documented in the changelog, peak working-set drops from approximately 2.5 GB to approximately 20 MB on large crawls. The batch crawl() API; which returns all pages at once; is unchanged.
MCP server and full CLI parity
The CLI exposes batch-scrape, batch-crawl, download, citations, and version; 1:1 with the core and MCP surfaces. The MCP server serves tools over both stdio and rmcp Streamable HTTP at /mcp. HTTP transport requires the binary to be compiled with the api and mcp Cargo features; the release CLI binary includes both. Each tool carries read_only/destructive/open_world safety annotations for agent orchestration frameworks that need to reason about side effects before calling tools.
Public substrate parsers
crawlberg::robots and crawlberg::sitemap are now public modules, usable without spinning up the full crawl engine. parse_robots_txt, is_path_allowed, RobotsRules, parse_sitemap_xml, parse_sitemap_index, and is_sitemap_index are all available standalone; useful for robots/sitemap preprocessing in pipelines that manage their own fetch layer.
Deep Technical Highlights
Escalation budget injection
EscalationBudget is a user-injectable trait. You can implement per-domain, per-hour browser budget caps, or tie escalation policy to real-time proxy cost signals. The built-in EwmaDomainState is designed for zero-configuration deployment; the trait interface is designed for when you have stronger opinions.
Lock-free corpus hot-reload
The WAF automaton lives under an ArcSwap. Readers take a guard for the duration of a single classification call; on the order of microseconds; and never contend with the writer. The writer side compiles a new automaton (tens of milliseconds for a large corpus) and swaps it in a single atomic store. In-flight requests complete against the old automaton; new requests use the new one immediately. Readers never block on corpus updates.
SSRF at every redirect hop
URL validation at the call site alone is insufficient: a hostname can pass the initial check, then DNS resolves to a private address after a short TTL expires (DNS rebinding). Crawlberg re-resolves and re-validates at every redirect, bounded by ssrf.max_redirects (default 5). The SsrfPolicy::from_env serde default means CrawlConfig deserialized from JSON automatically honors the environment variable; important for container deployments where env vars are the primary configuration channel.
Browser pool lifecycle
BrowserPool is public. Construct and warm(n) the pool at startup; pass it via CrawlEngineBuilder::with_browser_pool(). Browser instances are reused across crawl jobs rather than spawned per escalation event. CrawlEngineHandle::from_engine() produces a cloneable handle, so multiple async tasks can share a single engine and pool without additional coordination.
Asset downloads through the SSRF filter
Before this release, download_documents was honored only by single-page scrape(); the crawl loop fetched, flagged, and discarded the bytes. Downloads now route through http_fetch; the same transport as page fetches; so every file download is subject to the SSRF policy and per-domain rate limiting.
Performance Implications
The most directly measurable change is streaming memory: ~2.5 GB → ~20 MB peak working-set on large crawls (figures from the changelog; no external benchmark suite has been published for this release). The practical implication is that crawl corpus size is no longer bounded by available RAM when using crawl_stream.
Throughput: the tiered dispatch model adds a small latency overhead for requests that escalate; one additional HTTP probe before browser spin-up. Domains that respond normally to plain HTTP never pay this cost. EWMA per-domain state promotes well-behaved domains to start at the HTTP tier, avoiding unnecessary bypass or browser escalation for clean domains.
The ArcSwap-backed corpus reload is lock-free from the reader's perspective, so fingerprint corpus updates do not introduce latency spikes in production.
No benchmark numbers for throughput, requests-per-second, or latency percentiles are published in this release. Teams evaluating Crawlberg for high-throughput workloads should run their own benchmarks against the stable 1.0.0 surface; the Criterion benchmarks in the repository cover the WAF subsystem and are a starting point for extending coverage.
Language Bindings Spotlight
All 14 bindings are generated from the same Rust core by alef and contain no per-language extraction logic. The code snippets below are illustrative; check the per-language READMEs for exact API signatures.
Python
from crawlberg import CrawlConfig, scrape, crawl_stream
config = CrawlConfig(max_depth=3, max_pages=500, concurrency=20)
# Single-page extraction
result = await scrape("https://example.com", config=config)
print(result.markdown)
# Memory-bounded streaming crawl
async for event in crawl_stream("https://example.com", config=config):
if event.page:
print(event.page.url, event.page.title)
pip install crawlberg
JavaScript / Node.js
import { scrape, crawlStream, CrawlConfig } from "@xberg-io/crawlberg";
const config = new CrawlConfig({ maxDepth: 3, maxPages: 500, concurrency: 20 });
const result = await scrape("https://example.com", { config });
console.log(result.markdown);
for await (const event of crawlStream("https://example.com", { config })) {
if (event.page) console.log(event.page.url, event.page.title);
}
npm install @xberg-io/crawlberg
PHP
use XbergIo\Crawlberg\CrawlConfig;
use XbergIo\Crawlberg\Crawlberg;
$config = (new CrawlConfig())
->withMaxDepth(3)
->withMaxPages(500)
->withConcurrency(20);
$result = Crawlberg::scrape('https://example.com', $config);
echo $result->markdown;
composer require xberg-io/crawlberg
PHP 8.2, 8.3, and 8.4 are supported; precompiled NTS extensions ship for Linux (glibc, aarch64/x86_64), macOS (ARM64/x86_64), and Windows (VS16/VS17 x86_64).
Breaking Changes / Compatibility Notes
v1.0.0 is a breaking release for users of pre-release kreuzcrawl / kreuzberg-namespaced packages. For users already on the crawlberg name from 0.x pre-releases, the behavioral breaking changes are:
| Area | Old behavior | New behavior | Action required |
|---|---|---|---|
| Package identifiers |
kreuzcrawl everywhere |
crawlberg / @xberg-io/crawlberg etc. |
Update dependency declarations in all manifests |
| Env vars | KREUZBERG_* |
CRAWLBERG_* |
Update shell configs, CI env blocks, K8s Secrets |
| C FFI symbols | kcrawl_* |
cberg_* |
Recompile; update header includes and linker references |
| Go module path | github.com/kreuzberg-dev/kreuzcrawl/... |
github.com/xberg-io/crawlberg/packages/go |
go get new path; update all import statements |
CrawlError::WafBlocked |
Unit variant | Struct variant { vendor, message }
|
Update match arms to destructure |
NetworkErrorKind |
Exhaustive enum |
#[non_exhaustive] applied |
Add wildcard _ arms to exhaustive matches; recompile |
CrawlError / dispatch enums |
Exhaustive enums |
#[non_exhaustive] applied |
Add wildcard _ arms to exhaustive matches; recompile |
SimpleRetryPolicy |
max_retries=3 → 2 actual retries (off-by-one) |
max_retries=3 → 3 actual retries (fixed) |
Audit retry budgets if behavior depended on the old count |
DomainStatePort |
Mutation model | Observation model (recommend/observe) |
Update trait implementations if you implemented this trait |
| SSRF policy | Disabled by default | Enabled by default | Add CRAWLBERG_ALLOW_PRIVATE_NETWORK=1 for internal crawls |
Upgrade Guide
Step-by-step for existing kreuzcrawl users
1. Update package declarations
# Python
pip install crawlberg # replaces kreuzcrawl
# Node.js
npm install @xberg-io/crawlberg # replaces @kreuzberg/crawl
# PHP
composer require xberg-io/crawlberg # replaces kreuzberg/kreuzcrawl
# Go
go get github.com/xberg-io/crawlberg/packages/go@v1.0.0
# Rust
cargo add crawlberg@1.0.0 # replaces kreuzcrawl
# C# - verify the exact package ID in the C# README before running
dotnet add package XbergIo.Crawlberg
2. Update environment variables and configuration
# Old
KREUZBERG_ALLOW_PRIVATE_NETWORK=1
# New
CRAWLBERG_ALLOW_PRIVATE_NETWORK=1
3. Audit SSRF settings before first run
If you crawl internal networks (CI test targets, internal APIs, localhost), set CRAWLBERG_ALLOW_PRIVATE_NETWORK=1 before upgrading. Without it, requests to RFC 1918 and loopback addresses will fail with CrawlError::SsrfPolicyViolation.
4. Update C FFI call sites (if applicable)
Replace all kcrawl_ symbol references with cberg_. Regenerate cbindgen headers.
5. Fix CrawlError::WafBlocked match arms
// Old - unit variant
CrawlError::WafBlocked => { /* ... */ }
// New - struct variant
CrawlError::WafBlocked { vendor, message } => {
eprintln!("WAF block: {vendor}: {message}");
}
6. Add wildcard arms for #[non_exhaustive] enums
CrawlError, NetworkErrorKind, and all dispatch enums (EscalationReason, EscalationStrategy, etc.) are now #[non_exhaustive]. Any exhaustive match on these types will fail to compile until a wildcard _ => { ... } arm is added.
Verification checklist
- [ ]
crawlberg --versionprints1.0.0 - [ ]
scrape()returns a result for a known public URL - [ ] A streaming crawl over a multi-page site completes without OOM
- [ ]
CrawlError::WafBlockedmatch arms compile (struct variant) - [ ] All
CRAWLBERG_*env vars are present in CI/CD - [ ] No residual
KREUZBERG_*vars shadow the new names in your process environment - [ ] C FFI header compilation succeeds with
cberg_symbol names
Rollback
Previous kreuzcrawl packages remain published at their last version. Pin your dependency to the last kreuzcrawl version and revert env vars if you need to roll back while investigating issues.
Operational Guidance for Production
Tuning recommendations
- Concurrency: Start at 10–20 per domain. The global concurrency cap and per-domain rate limiter are enforced independently; the global cap prevents resource exhaustion, the per-domain limiter prevents hammering individual targets.
-
Depth and page limits: Set
max_depthandmax_pagesconservatively. Even withcrawl_stream's bounded memory, an uncapped frontier can grow large on deeply-linked sites. -
Retry budget:
max_retriesis now exact (off-by-one fixed). If your configuration relied on the old count, validate your retry/backoff math before deploying. -
Browser pool pre-warming: Call
BrowserPool::warm(n)at startup. Lazy browser spin-up adds significant latency to the first escalated request per domain; pre-warming eliminates that spike. -
Proxy rotation: Implement
ProxyProviderfor production anti-blocking workloads. The trait is async; you can call an external proxy API per request without blocking the crawl loop.
Observability checklist
- Wire
crawlberg_waf_fingerprint_matches_totalandcrawlberg_escalations_totalinto your metrics system. These counters are available via the OpenTelemetry integration; enabling OTLP export requires a Cargo feature — check the crate'sCargo.tomlfor the exact feature name before adding it to your dependency declaration. - Set
RUST_LOG=crawlberg=infofor structured tracing output in production;=debugfor request-level detail. - Treat
CrawlError::SsrfPolicyViolationas a security event; log the violating URL and source. - Track the ratio of HTTP-tier successes to browser-tier escalations per domain. A sustained high escalation ratio for a domain you expect to be cooperative signals a misconfigured fingerprint corpus or an actual WAF rollout.
Failure modes and mitigations
| Mode | Signal | Mitigation |
|---|---|---|
| Rate-limited by target | HTTP 429, escalation signal indicating rate limiting (check EscalationReason type docs for the exact variant) |
Increase rate_limit_delay_ms; reduce concurrency |
| WAF block loop |
crawlberg_escalations_total elevated for one domain |
Inspect fingerprint corpus; tune EscalationBudget
|
| OOM on large crawl | RSS growing unboundedly | Confirm you are using crawl_stream, not crawl
|
| SSRF violations in CI |
SsrfPolicyViolation errors on test targets |
Add CRAWLBERG_ALLOW_PRIVATE_NETWORK=1 to CI environment |
| Browser pool exhausted | Slow escalation, request queue buildup | Increase pool size or reduce browser-tier concurrency |
| MCP HTTP endpoint not responding | No tools returned from /mcp
|
Verify binary was built with the api and mcp Cargo features |
| Swift package not resolving | SwiftPM branch not found error |
Upgrade to 1.0.0; the release/swift/1.0.0 branch is now correctly created |
Security and Responsible Crawling
Crawlberg's default-on SSRF defense closes the most common server-side request forgery vector for multi-tenant deployments, but responsible crawling requires attention beyond internal network safety:
-
robots.txt: The engine fetches and respects
robots.txtautomatically. If a site specifies aCrawl-delay, setrate_limit_delay_msto at least that value. - Rate limiting: As a general starting point (not a framework default), 1,000–2,000 ms between requests per domain avoids hammering most public sites. Aggressive crawling harms target infrastructure regardless of technical capability.
-
User-Agent: Set a descriptive
User-Agentthat identifies your crawler and includes a contact URL or email address. This gives site operators a channel to reach you. - Terms of service: SSRF defense and robots.txt compliance are technical safeguards, not legal authorization. Review the ToS of any site you crawl at scale.
- Data retention: Crawled content may contain personal data. Apply your jurisdiction's retention and deletion requirements.
Contributor and Ecosystem Notes
Crawlberg's 14 language bindings are generated by alef (pinned at 0.26.6 in this release). Contributing to a binding means editing alef templates, not the generated crates directly.
Areas where the community can contribute:
-
WAF fingerprint corpus:
rules/waf_fingerprints.tomlbenefits from real-world signal data. PRs adding fingerprints with test cases and site-class annotations are a concrete, low-barrier contribution. - Per-language ergonomics: The generated APIs are consistent but conservative. Per-language maintainers are welcome to propose binding-layer ergonomic improvements that stay within the generated API contract.
- Benchmarks: Criterion benchmarks for the WAF subsystem ship in the repo. Throughput and latency benchmarks for the HTTP and streaming layers are an open gap.
- MCP integrations: The MCP server opens Crawlberg to any agent framework that speaks Model Context Protocol. Reference integration guides for popular frameworks (LangChain, CrewAI, Pydantic AI) are high-value additions.
-
Docs:
docs.crawlberg.xberg.iois built fromdocs/. Use-case guides — data pipelines, AI agent integrations, e-commerce monitoring, academic web archiving; are welcome.
FAQ
Q: Is this a drop-in replacement for kreuzcrawl?
Functionally yes; the API shape is identical. The package names, env vars, C FFI symbols, and Go module path have changed. Follow the upgrade guide; the migration is mechanical.
Q: Does the SSRF defense break internal test environments?
It will if your tests crawl localhost or RFC 1918 addresses. Set CRAWLBERG_ALLOW_PRIVATE_NETWORK=1 in your test environment, or call CrawlConfig::allow_private_networks(true) in test setup code.
Q: When should I use crawl_stream instead of crawl?
Use crawl_stream for any crawl larger than a few hundred pages. It bounds peak memory at ~20 MB regardless of corpus size. Use the batch crawl() only when you need the full result set in memory at once — which is uncommon in production pipelines.
Q: Is the 1.0.0 API stable across all 14 bindings?
The Rust crate, CLI, and MCP tool definitions are stable. For Elixir specifically: the repository's Quick Start currently shows {:crawlberg, "~> 0.3"} — verify the current published version at hex.pm/packages/crawlberg before pinning.
Q: Does tiered dispatch slow down crawls that don't need browser rendering?
No. Domains that respond normally at the HTTP tier never escalate. EWMA per-domain state promotes well-behaved domains to start at the HTTP tier, so they avoid bypass and browser probes entirely after the initial warm-up within a session.
Learn more
- Release: github.com/xberg-io/crawlberg/releases/tag/v1.0.0
- Repository: github.com/xberg-io/crawlberg
- Documentation: docs.crawlberg.xberg.io
-
Getting started:
pip install crawlberg·npm install @xberg-io/crawlberg·cargo add crawlberg - Discord: discord.gg/xt9WY3GnKR
Top comments (0)