DEV Community: Kun Shen

How to Build a Crawl Budget That Keeps AI Agents Fast and Predictable

Kun Shen — Sun, 19 Jul 2026 02:52:52 +0000

AI agents often begin with a deceptively simple web-access loop: take a URL, fetch it, extract text, and pass the result to a model. That loop works in a demo. In production, it can become a source of latency spikes, runaway costs, repeated requests, and inconsistent evidence.

A crawl budget is the control system that keeps this work predictable. It is more than a request limit. A useful budget decides which pages deserve attention, how much effort each page may consume, and when the agent should stop.

Start with the job, not the crawler

The correct budget depends on the agent's task. A monitoring agent may revisit a small set of pages on a schedule. A research agent may explore many domains once. A shopping agent may need current prices but can ignore most navigation pages.

Write the task as a small contract before choosing limits:

what evidence must be collected;
how fresh the evidence needs to be;
how many independent sources are required;
the maximum acceptable latency;
the maximum cost per completed task.

This contract prevents the crawler from treating every discovered URL as equally valuable.

Give every request an expected value

A URL should enter the queue with a reason. Useful signals include its relationship to the query, the authority of its host, its distance from a known source, its content type, and the chance that it contains new information.

A simple priority score can combine those signals: priority equals relevance times freshness need times source value, divided by expected cost.

The formula does not need to be mathematically perfect. Its purpose is to make tradeoffs visible. A product specification linked from a manufacturer's page should normally outrank a tag archive discovered five clicks away.

Expected cost should include more than bandwidth. JavaScript rendering consumes more time and compute than a direct HTML fetch. A screenshot adds storage and downstream vision cost. Retries also consume the budget, even when they produce no content.

Use a staged access strategy

The cheapest successful method should win. Start with a normal fetch and examine the result. Escalate to rendering only when the response lacks the content that should be present, relies on client-side navigation, or contains an application shell instead of the requested data.

Search is often a better first step than blind crawling. A targeted search can identify a few relevant pages before the agent spends budget extracting them. For focused discovery, AnyCrawler's search-page endpoint is one example of a workflow that combines result discovery with page-level access.

Screenshots should be deliberate. They are valuable when layout, charts, canvas elements, or visual state are evidence. They should not be the default representation of a text article.

Separate task, host, and page budgets

One global limit is too coarse. Use three layers.

A task budget limits total requests, rendered pages, bytes, elapsed time, and retries for one user goal. A host budget prevents one domain from consuming the entire task. A page budget caps the work spent on a single stubborn URL.

Host-level controls also improve politeness. Limit concurrency per host, respect crawl directives, and add delays when a server returns rate-limit or overload responses. Backoff should consume elapsed-time budget so that the agent cannot wait forever.

Page budgets should define a clear escalation ceiling. For example, allow one fetch, one render attempt if justified, and one retry for a transient failure. Authentication walls, persistent access denials, and repeated empty responses should become explicit outcomes rather than infinite loops.

Deduplicate before spending

Agents frequently encounter the same content through tracking parameters, alternate paths, print views, and redirects. Normalize URLs before enqueueing them. Remove known tracking parameters, resolve relative links, and store the final URL after redirects.

Content fingerprints catch duplicates that URL rules miss. A lightweight hash of normalized main text can prevent the same syndicated article from being processed repeatedly. Keep the source URLs even when content is duplicated; provenance still matters.

Caching should reflect freshness requirements. Stable documentation can be reused longer than a live price or breaking-news page. Record the retrieval time and cache policy next to the extracted evidence so the agent can decide whether reuse is acceptable.

Make stopping a first-class decision

A good agent stops because it has enough evidence, not merely because it has exhausted the web. Define completion signals such as:

the required facts are supported by two independent sources;
new pages have stopped adding unique claims;
remaining queue items fall below a value threshold;
the time or cost reserve is needed for synthesis and verification.

Reserve part of the total budget for verification. Discovering ten pages is not useful if no capacity remains to check their claims, dates, and canonical sources.

Measure outcomes, not just requests

Request counts alone cannot reveal whether a budget works. Track useful pages per task, unique evidence items, duplicate rate, render escalation rate, median and tail latency, bytes transferred, and cost per accepted source.

Also log why pages were skipped or stopped. Reasons such as low relevance, duplicate content, access denied, budget exhausted, and stale cache make later tuning possible.

Review failures by task type. If research tasks often run out of render budget, the discovery stage may be selecting too many application pages. If monitoring tasks repeatedly fetch unchanged documents, caching or conditional requests need improvement.

A practical default policy

A reasonable starting policy is conservative: search first, fetch selected pages, render only on evidence of client-side content, and capture screenshots only for visual claims. Cap per-host concurrency, normalize and deduplicate URLs, reserve verification capacity, and stop when evidence coverage is sufficient.

The exact numbers will change with the product and workload. The structure should remain stable. A crawl budget turns web access from an open-ended exploration into an accountable resource allocation process. That makes agents faster, cheaper, easier to debug, and more respectful of the sites they depend on.

Model an Evidence Chain, Not a Bag of Citations

Kun Shen — Sat, 18 Jul 2026 04:26:39 +0000

AI research products often display citations, but a row of links at the bottom of an answer does not tell you how the answer was built. A citation can be relevant to the topic without supporting the sentence beside it. Several links can repeat the same underlying source. A model can also drop important uncertainty while turning notes into fluent prose.

The data model has to preserve more than URLs. It should preserve the path from the original question to research queries, retrieved material, evidence grouped by claim, and the final report. We call that path an evidence chain.

This article focuses on the database and pipeline boundaries that make such a chain reconstructable.

The common shortcut: one giant JSON result

The fastest implementation is usually a job table with an input question and one JSON column containing everything else. It works until the product needs to answer operational questions:

Which research query found this source?
Which source supported this claim?
Did two citations come from the same domain?
Was this page reused from an earlier crawl?
Which model stage changed the wording?
Can we regenerate the summary without repeating the crawl?

A single blob makes these questions expensive and fragile. Every query becomes a custom JSON traversal, and relationships that should be enforced by keys exist only by convention.

The better approach is to keep the original run as the parent while giving important artifacts their own records.

1. Start with a search run

A search run represents one research question and its lifecycle. It owns the status, timestamps, final report, content metadata, and publication decision.

The run should not pretend that generation is atomic. A real research workflow may expand keywords, search multiple providers, crawl pages, extract page-level facts, aggregate evidence, synthesize a report, classify the content, and calculate an indexability result. Those stages should all reference the same run ID.

This parent key is what lets an operator reconstruct one execution without correlating timestamps across unrelated log streams.

2. Store research queries in order

Generated research queries deserve their own ordered records. Order matters because it shows the strategy the system attempted, and stable indices make retries deterministic.

A minimal query record needs the run ID, a query index, and the query text. You may also want the prompt version that produced it, the provider used, and whether the query was executed or skipped.

Keeping queries separate makes it possible to evaluate query quality independently from answer quality. If a report misses an important perspective, you can tell whether the gap originated in query generation or later retrieval.

3. Treat crawled pages as artifacts

Search results and crawled pages are inputs, not yet evidence. A page artifact should record the run, artifact type, query or keyword, URL, retrieval metadata, and the raw or normalized payload needed by later stages.

That distinction prevents a dangerous shortcut: assuming that every retrieved page supports the answer. Most search results are candidates. Some are duplicates, some only mention the topic, and some contradict the emerging conclusion.

Page artifacts also create a clean caching boundary. A system can reuse a recent crawl for the same URL while still running fresh evidence extraction for a new question. Retrieval freshness and claim relevance are different concerns and should not share one cache key.

4. Model evidence around claims

Evidence becomes useful when it is connected to a claim. An evidence record can contain:

the parent run ID
a stable evidence index
the claim it supports or challenges
the research keyword or query
source title, summary, and URL
normalized domain
an explanation of relevance
nested source details when several passages support one item

The claim field is the critical part. It gives reviewers a unit they can inspect. Instead of asking whether a source is “about the topic,” they can ask whether the source supports the specific statement that will appear in the report.

This structure also enables domain-diversity checks and source counts without parsing rendered Markdown. Quality rules should operate on evidence records before the final page exists.

5. Keep model-call provenance beside content provenance

Content provenance explains where facts came from. Model-call provenance explains how the system transformed those facts. Both are required for reproducibility.

For each stage, retain request messages, structured inputs, the raw provider response, parsed output, model and reasoning configuration, token usage, latency, prompt version, and errors. Link every call to the same run.

With that relationship, a reviewer can move in both directions:

from a sentence in the report to its claim and sources
from a suspicious source to the extraction and synthesis calls that consumed it

This is much more actionable than a generic “generated by AI” label.

6. Save the final report, but do not make it the source of truth

The final report is a presentation artifact. It may be Markdown or HTML, and it may include inline links for readers. But source counts, domain counts, confidence, and evidence sufficiency should come from structured data, not from scraping the report after generation.

That separation has a practical benefit: presentation can change without destroying provenance. You can redesign citation components, generate mobile summaries, or expose an API while keeping the same underlying evidence relationships.

7. Make insufficiency a valid outcome

An evidence chain should be able to end without a publishable answer. If retrieval produces too few sources, domains are not independent, or the evidence conflicts, the run can complete with an insufficiency flag and clear reasons.

This is not a pipeline failure. It is a research result. Treating insufficiency as data prevents the system from filling gaps with more confident prose simply to satisfy a success state.

8. Expose enough of the chain to readers

Internal logs can contain sensitive or operational details, so they should not be dumped into a public page. Readers still benefit from a carefully selected public surface:

visible source links
source titles and domains
claim-oriented evidence groups
correction and attribution channels
methodology and editorial policy pages

The public Omniracle methodology describes how signal discovery, research, evidence, model provenance, and publication gates fit together. The recommended reports show the reader-facing side of that architecture.

A useful mental model

Think of the pipeline as a directed chain:

question → research queries → page artifacts → claim evidence → synthesis calls → final report → publication decision

Each arrow should be represented by a durable relationship, not inferred later from similar text. When a result is challenged, the system can then answer the most important engineering question: not merely “Which links were shown?” but “How did this claim travel from source material into the published answer?”

That is the difference between citation decoration and an auditable research system.

Building Reliable Web Access for AI Agents: Search, Crawl, Markdown, and Screenshots

Kun Shen — Mon, 15 Jun 2026 16:54:41 +0000

AI agents are only as useful as the context they can reach. For many product, research, support, and competitive-intelligence workflows, that context lives on public websites: documentation pages, changelogs, pricing pages, articles, search results, screenshots, and long-tail reference content.

The hard part is not simply "scraping a page." The hard part is giving an agent a repeatable web access layer that can:

search for candidate sources,
fetch static pages cheaply,
render JavaScript-heavy pages when needed,
convert pages into clean markdown,
capture screenshot evidence,
retry safely when upstream sites are slow,
and avoid flooding the model context with irrelevant HTML.

This is where a web scraping API or crawler API becomes more useful than ad hoc browser scripts.

A practical pattern for agent web access

For most AI agent workflows, I like to split web access into four steps.

1. Search first, crawl second

Agents often do better when they first discover likely sources instead of starting with one URL. A search API for AI agents can return public web, news, image, video, or scholar results. The agent can then choose the highest-signal pages to read.

This reduces unnecessary crawling and gives the model a better source set.

2. Use fetch before render

Many pages do not need a headless browser. Documentation, blog posts, landing pages, legal pages, and static HTML often contain the useful content in the initial response.

For those pages, a fetch-based web data extraction API is usually faster, cheaper, and more reliable.

Use browser rendering only when the page depends on client-side JavaScript, hydration, or late network calls.

3. Convert pages to markdown

Raw HTML is noisy. Agents usually need a compact representation:

page title,
main content,
links,
metadata,
selected media,
and readable markdown.

Website to markdown conversion is a simple change that often improves answer quality because the model sees content instead of layout scaffolding.

4. Capture screenshots when trust matters

Text extraction is enough for many tasks, but not all of them. When an agent is checking visual layout, pricing evidence, legal copy, product UI, or compliance-sensitive content, a screenshot API gives a durable record of what the page looked like.

Where AnyCrawler fits

I have been testing AnyCrawler as an agent-facing web access layer. It combines public search, page crawling, markdown extraction, browser rendering, and screenshots behind API endpoints that are easier for agents to call than a full browser automation stack.

The useful part is the routing model:

use fetch crawling for static or content-first pages,
use render crawling for JavaScript-heavy pages,
use screenshots when visual evidence matters,
and use search before crawling when the source URL is not already known.

There is also an open skill package for agent runtimes here:

https://github.com/AnyCrawler-com/AnyCrawler-Skill

Design advice

If you are adding web access to an AI agent, avoid making the browser the first tool for every task. A better default is:

Search if the source is unknown.
Fetch the page if content is likely available in HTML.
Render only when fetch is incomplete.
Convert the result to markdown.
Capture screenshots only when the task needs visual proof.

That structure keeps workflows faster, less expensive, and easier to debug.