Reliable Web-Connected AI Agents Start at the Fetch Layer

#ai #webscraping #automation #architecture

A web-connected agent usually fails before the model gets a chance to reason. The request hits a 403. The page hydrates after your scraper reads it. A selector returns null because the site shipped an A/B test. Then the agent tries to continue with partial data and you get a confident answer built on garbage.

The fix is not a better prompt. Treat web access as an unreliable infrastructure dependency and put a hard boundary between fetching data and reasoning over it.

Classify failures before retrying

A lot of agent loops retry too blindly. They send the same request again after every failure, which wastes time and makes rate limits worse.

Classify the failure first:

502, 503, network reset: probably transient, retry with backoff
429: rate limited, respect Retry-After if present
401, 403: credentials, permissions, or bot detection; do not blindly retry
malformed extraction: page shape changed or JavaScript did not finish rendering
repeated timeout: escalate configuration, not prompt

Here is a small TypeScript example. It is intentionally boring because the boring version is easier to debug at 3 a.m.

type FetchMode = "http" | "browser";

type FetchResult = {
  ok: boolean;
  status?: number;
  html?: string;
  error?: string;
  mode: FetchMode;
};

const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));

function retryableStatus(status?: number) {
  return status === 502 || status === 503 || status === 504;
}

function shouldEscalateToBrowser(result: FetchResult) {
  return (
    result.status === 403 ||
    result.error?.includes("selector returned null") ||
    result.error?.includes("missing required field")
  );
}

async function fetchHttp(url: string): Promise<FetchResult> {
  try {
    const res = await fetch(url, { headers: { "user-agent": "Mozilla/5.0" } });
    const html = await res.text();
    return { ok: res.ok, status: res.status, html, mode: "http" };
  } catch (err) {
    return { ok: false, error: String(err), mode: "http" };
  }
}

async function fetchWithPolicy(url: string): Promise<FetchResult> {
  for (let attempt = 0; attempt < 3; attempt++) {
    const result = await fetchHttp(url);

    if (result.ok) return result;

    if (result.status === 429) {
      await sleep(2_000 * (attempt + 1));
      continue;
    }

    if (retryableStatus(result.status)) {
      await sleep(1_000 * 2 ** attempt);
      continue;
    }

    if (shouldEscalateToBrowser(result)) {
      return runBrowserFetch(url); // Playwright, Browserbase, etc.
    }

    return result;
  }

  return { ok: false, error: "retry budget exhausted", mode: "http" };
}

The important part is not the exact code. It is the policy: retries are bounded, errors have categories, and browser automation is an escalation path rather than the default.

Wire is Anakin's web data layer for agents, and in this context it is useful because it keeps API-style extraction, browser fallback, and structured output outside the model loop.

Use HTTP first, browser automation only when needed

Browser automation solves real problems. It executes JavaScript, preserves cookies, handles client-side routing, and looks more like an actual user session than a raw HTTP client.

It also costs more and runs slower. You pay for browser startup, rendering, hydration waits, proxy routing, and heavier infrastructure. If you run browser-first across a large crawl, the cost difference compounds quickly.

Start with plain HTTP when the page is server-rendered or exposes structured data:

curl -s https://example.com/products/123 \
  -H 'user-agent: Mozilla/5.0' \
  | pup 'h1 text{}'

Move to browser mode when you see symptoms like these:

the HTML contains only <div id="root"></div> and script tags
required data appears in DevTools after hydration, but not in curl
HTTP requests return 403 Forbidden while a browser session succeeds
content depends on cookies, localStorage, or a multi-step login
extraction fails intermittently because modals or region prompts appear

A good production pipeline can use both. HTTP handles the cheap path. Browser sessions handle the pages that actually need a browser. Wire routes between these modes per action, so the agent does not need to embed that decision logic itself.

Log data-layer failures separately from model failures

If an agent returns a bad answer, you need to know whether the model reasoned badly or the fetch layer handed it bad input.

Log at least these fields per job:

{
  "job_id": "price-check-8271",
  "url": "https://example.com/item/123",
  "fetch_mode": "http",
  "status": 403,
  "attempts": 2,
  "latency_ms": 4810,
  "extraction_valid": false,
  "error": "missing required field: price"
}

Track success rate, p95 latency, block rate, and cost per completed job. Average latency hides retry cascades. Total spend hides waste. A rising p95 with stable success rate often means your fallback path is working but getting more expensive. A rising block rate means you should inspect fingerprints, request pacing, or proxy routing before touching prompts.

Also store enough response data to replay failures. If a selector starts returning null, you want the actual HTML that caused it. Otherwise you are debugging a memory of a page that no longer exists.

Keep credentials out of the agent’s reach

Authenticated browsing creates a different class of risk. The page can contain untrusted instructions, hidden text, or injected markup. If the agent has direct access to long-lived credentials, prompt injection can become credential theft.

Use scoped credentials and short session lifetimes:

never put API keys in prompts or tool descriptions
keep cookies in an isolated browser context per job or tenant
redact secrets from logs before storing request traces
rotate credentials used by automation separately from human accounts
destroy sessions after sensitive workflows unless reuse is required

For Playwright-style automation, that usually means creating a fresh context and closing it explicitly:

const context = await browser.newContext({ storageState: "auth-state.json" });
const page = await context.newPage();
await page.goto("https://example.com/account/orders");
// extract only the fields the agent needs
await context.close();

Persistent sessions are faster, but they expand the blast radius when something goes wrong. Use them where the performance gain is worth the security tradeoff.

A practical next step: pick one existing web-connected workflow, add explicit failure classification and fetch-mode logging, then review the next 100 failures before changing the prompt.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.