<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Spinov </title>
    <description>The latest articles on DEV Community by Alex Spinov  (@0012303).</description>
    <link>https://dev.to/0012303</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3831260%2F88c2b1ec-9abb-44c0-a6b8-774b9f415fce.PNG</url>
      <title>DEV Community: Alex Spinov </title>
      <link>https://dev.to/0012303</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/0012303"/>
    <language>en</language>
    <item>
      <title>Your AI Agent Re-Reads Every Page It Already Saw. I Measured the 8x Context Tax</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Sat, 13 Jun 2026 18:16:08 +0000</pubDate>
      <link>https://dev.to/0012303/your-ai-agent-re-reads-every-page-it-already-saw-i-measured-the-8x-context-tax-38kg</link>
      <guid>https://dev.to/0012303/your-ai-agent-re-reads-every-page-it-already-saw-i-measured-the-8x-context-tax-38kg</guid>
      <description>&lt;p&gt;Turn 1 cost about 300 input tokens. Turn 20 cost 7,000. Same agent, same kind of page, 20 times more expensive for the last step than the first. Nothing was broken. The agent gave the right answer the whole way. It just kept paying for every page it had already read.&lt;/p&gt;

&lt;p&gt;If you run a ReAct loop, a LangChain agent, or your own &lt;code&gt;while&lt;/code&gt; loop that keeps the full transcript in &lt;code&gt;messages&lt;/code&gt;, you are probably paying this. Here is the cumulative billed-input number, measured the same way for both strategies, plus the one counter-argument (prompt caching) that an honest version of this post has to address. There is a 40-line file at the end you can run in five seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the context tax in an agent loop?
&lt;/h2&gt;

&lt;p&gt;A naive agent keeps every fetched page in its message history. On turn k it re-sends pages 1 through k as billed input. Walk 20 different pages and the model gets charged for the first page 20 times. The sum is quadratic. A budget layer keeps a bounded window (the current page plus one short rolling summary), so the same walk stays linear. Each page is fine. The repetition is the tax.&lt;/p&gt;

&lt;p&gt;I want to be careful here, because the last draft of this post was wrong. So let me show the number first, then the catch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers (run it yourself)
&lt;/h2&gt;

&lt;p&gt;This is the real stdout of &lt;code&gt;agent_context_budget.py&lt;/code&gt;. No network, no I/O, deterministic. The token count is a &lt;code&gt;len // 4&lt;/code&gt; proxy, not a tokenizer, and I will defend that choice below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent context tax: cumulative billed INPUT tokens (proxy, synthetic)
page proxy ~= 322 tokens, modeled on real page sizes (not measured here)

--- one session, 20 different growing pages ---
turn |  naive billed |  budget billed
   1 |           322 |            402
   2 |           966 |            804
   5 |         4,830 |          2,010
  10 |        17,774 |          4,084
  15 |        39,984 |          6,414
  20 |        71,844 |          8,744

N=20: naive=71,844  budget=8,744  -&amp;gt; 8.2x, -88%
with ideal prompt caching: naive=15,400  -&amp;gt; 1.8x (raw was 8.2x)

how the gap grows with N (same meter, both cumulative):
   N |     naive |  budget |  raw x | cached x
  10 |    17,774 |   4,084 |   4.4x |     1.4x
  20 |    71,844 |   8,744 |   8.2x |     1.8x
  30 |   164,514 |  13,404 |  12.3x |     2.2x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The headline: 20 pages, 8.2x more billed input for the naive loop, summed across the session. Drop the same agent down to a bounded window and you spend 88% less.&lt;/p&gt;

&lt;p&gt;Look at turn 1, though. Budget costs 402, naive costs 322. The window layer is &lt;em&gt;more&lt;/em&gt; expensive on a single turn, because the rolling summary is not free. The win only shows up once the transcript starts repeating. If your agent does two fetches and stops, skip all of this. The tax is a distance problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the naive loop is quadratic (this time it is actually modeled)
&lt;/h2&gt;

&lt;p&gt;Turn k re-sends pages 1..k. So billed input per turn grows by about a page each turn (322, 644, 966, and so on). The running total is roughly &lt;code&gt;page * (1 + 2 + ... + N)&lt;/code&gt;, the sum of an arithmetic series. That is O(N squared). The budget loop sends one page plus a fixed summary every turn, so its total is about &lt;code&gt;(page + 80) * N&lt;/code&gt;. Linear. (The pages are not perfectly equal in my fixture, so the printed naive total runs a little above the round-number formula. The shape is what matters, and the printed numbers are what you reproduce.)&lt;/p&gt;

&lt;p&gt;That is the whole shape of it. Roughly the same page size, the same agent, one strategy keeps re-serializing history and one does not. The &lt;code&gt;raw x&lt;/code&gt; column climbs 4.4, 8.2, 12.3 because the gap between a parabola and a line widens with N. It is not a constant multiplier. It is a tax that gets worse the longer your agent works.&lt;/p&gt;

&lt;p&gt;I am stressing this because my first version of this article claimed "quadratic" in prose while the code was actually linear (one page re-sent N times has no curve). A reviewer caught it. This version walks N &lt;em&gt;different&lt;/em&gt; growing pages, so the quadratic is in the code, not in the adjective.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest part: doesn't prompt caching fix this?
&lt;/h2&gt;

&lt;p&gt;Yes, someone is already typing it. "Anthropic and others cache the prompt prefix, so the repeated transcript is billed at roughly a tenth. Your naive loop is not 8x, it's barely worse." Fair. If I hid that, this post would deserve the same fate as the last one.&lt;/p&gt;

&lt;p&gt;So I modeled it. &lt;code&gt;run_naive_cached&lt;/code&gt; bills the already-cached tail at 0.1x (cache read) and writes each new page once at 1.25x (cache write), which is roughly Anthropic's published cache pricing shape. The result is the &lt;code&gt;cached x&lt;/code&gt; column above:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;N pages&lt;/th&gt;
&lt;th&gt;naive raw&lt;/th&gt;
&lt;th&gt;naive cached&lt;/th&gt;
&lt;th&gt;budget&lt;/th&gt;
&lt;th&gt;cached vs budget&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;17,774&lt;/td&gt;
&lt;td&gt;5,554&lt;/td&gt;
&lt;td&gt;4,084&lt;/td&gt;
&lt;td&gt;1.4x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;71,844&lt;/td&gt;
&lt;td&gt;15,400&lt;/td&gt;
&lt;td&gt;8,744&lt;/td&gt;
&lt;td&gt;1.8x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;164,514&lt;/td&gt;
&lt;td&gt;29,106&lt;/td&gt;
&lt;td&gt;13,404&lt;/td&gt;
&lt;td&gt;2.2x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Caching cuts the gap hard. It does not close it. Even with a &lt;em&gt;perfect&lt;/em&gt; cache, the naive loop still re-reads the whole growing tail every turn, and that cheap-but-not-free read still adds up faster than a bounded window. At 20 pages it is 1.8x. The snowball survives the best case you can give it.&lt;/p&gt;

&lt;p&gt;And the best case is generous. Anthropic's prompt cache has a 5-minute TTL. An agent loop with slow tool steps (a fetch, a parse, a model call, another fetch) can easily blow past that between turns, and then the tail falls out of cache and you pay closer to the raw 8.2x. So the real-world number for a tool-heavy agent sits &lt;em&gt;between&lt;/em&gt; 1.8x and 8.2x, leaning toward raw the slower your steps are. I will not pretend to know exactly where your loop lands. That depends on your TTL luck and your step latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where our production data actually comes in
&lt;/h2&gt;

&lt;p&gt;I do not have a clean public benchmark of "agent transcript growth across 20 fetches," so I am not going to fake one. What I do have is real page sizes. Across roughly 2,190 production runs on our scrapers, listing and review page bodies after cleaning tend to land in the few-hundred-token range, which is where I set the fixture's page size. The Trustpilot scraper alone has 962 runs, so "a review page is about this big" is something I have actually watched, not guessed.&lt;/p&gt;

&lt;p&gt;That is the only place real data touches this post: the &lt;em&gt;size&lt;/em&gt; of a page, used as a parameter. I did not measure the multiplier 962 times. The multiplier comes from the model. Be suspicious of anyone who blurs those two.&lt;/p&gt;

&lt;h2&gt;
  
  
  About that proxy token count
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;toks(text)&lt;/code&gt; is &lt;code&gt;len(text) // 4&lt;/code&gt;. Not tiktoken. I did this on purpose, and not to be lazy.&lt;/p&gt;

&lt;p&gt;If I pinned a real tokenizer, the exact stdout would shift every time the library version moved, and you could not reproduce my MD5. The number you care about is a &lt;em&gt;ratio&lt;/em&gt; (naive over budget), and a constant proxy cancels out of a ratio. Whether a page is 322 proxy-tokens or 410 real ones, naive-over-budget barely moves. What I lose is the right to say "your invoice will read exactly $X." What I keep is a result you can rerun and get bit-for-bit identical. For a structural argument about O(N squared) versus O(N), that is the correct trade.&lt;/p&gt;

&lt;p&gt;I ran it twice and diffed the output. Same MD5 both times. If you run the file at the bottom, you should get the same stdout I pasted above.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix is smaller than the problem
&lt;/h2&gt;

&lt;p&gt;The budget layer is three ideas, none clever:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Send the current page, not the whole transcript.&lt;/li&gt;
&lt;li&gt;Keep one short rolling summary of what came before, capped at a fixed size.&lt;/li&gt;
&lt;li&gt;Never let the window grow with the turn count.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is it. The &lt;code&gt;run_budget&lt;/code&gt; function is four lines. The hard part is not the code, it is noticing the bill in the first place, because every quality metric stays green while the cost curve bends upward in the background.&lt;/p&gt;

&lt;p&gt;This is the failure class I keep running into with agents: the thing works. Every page is a clean 200. Every answer is correct. The eval dashboard is green. And the only signal that something is wrong is a token bill that grows faster than the task list. It is not a logic bug you can catch in a test. It is a quiet tax on the loop, and quiet taxes are the ones you pay longest, because nothing ever pages you about them.&lt;/p&gt;

&lt;p&gt;If you want to see it in your own agent, open the token log and plot billed input per turn. A flat-ish line means you have a window. A line that climbs every turn means you are paying the tax, and the slope is your N.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is wrong
&lt;/h2&gt;

&lt;p&gt;I would rather you trust the limits than the headline.&lt;/p&gt;

&lt;p&gt;The rolling summary loses detail. An 80-token summary of fifteen pages is lossy, and if the agent needs an exact fact from page 3 on turn 18, a naive loop still has it verbatim and the budget loop might not. That is a real correctness cost, not a free lunch. The honest framing is a trade: you spend less, you remember less precisely.&lt;/p&gt;

&lt;p&gt;The budget can drop a page the agent later wants. My window assumes the current sub-task only needs the current page plus a gist. Some tasks genuinely need to cross-reference page 2 and page 19. For those, a fixed window is the wrong tool and you want retrieval, not truncation.&lt;/p&gt;

&lt;p&gt;And the numbers are mine. N=20, that page size, that summary size, ideal-or-zero caching. Your agent, your pages, your TTL will give a different multiplier. The shape (quadratic versus linear) holds. The exact 8.2x does not transfer. Treat it as the direction, not the destination.&lt;/p&gt;

&lt;p&gt;This is the third in a small line of notes about giving an agent a web tool that does not quietly hurt you: one gave the agent a fetch tool, one taught it not to trust a 200 OK that was garbage, and this one is about not drowning the context in pages it already read. Each one is a &lt;code&gt;curl&lt;/code&gt; that returns 200 and a cost you only see later.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Aleksej Spinov. Numbers in this post are the output of a deterministic synthetic model (&lt;code&gt;agent_context_budget.py&lt;/code&gt;, included in full below); the only real-world input is page size, modeled on roughly 2,190 production scraper runs. AI-assisted drafting, human-reviewed and run before publishing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow for the next set of numbers from production runs, and tell me in the comments: what does billed-input-per-turn look like in your own agent log? I read every reply. 👇&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The full file
&lt;/h2&gt;

&lt;p&gt;This is &lt;code&gt;agent_context_budget.py&lt;/code&gt; exactly as I ran it. Pure stdlib, no dependencies, no network. Save it, run &lt;code&gt;python3 agent_context_budget.py&lt;/code&gt;, and you should get the stdout pasted near the top of this post.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;agent_context_budget.py - what a naive agent loop pays vs a windowed one.

The fetch already works (#18) and each page is real content (#19). The problem
is the LOOP. A naive ReAct/while agent keeps the FULL transcript in `messages`,
so on turn k it re-sends pages 1..k as billed input. Walk N DIFFERENT pages and
that sum is O(N^2). A budget layer keeps a bounded window - the current page
plus one rolling summary of the past - and the same walk costs O(N).

SAME METER FOR BOTH SIDES. Every function below returns the cumulative billed
INPUT tokens summed over the whole session. Not marginal-on-one-side,
cumulative-on-the-other (that mistake is exactly what made an earlier draft of
this post wrong). Both strategies are charged for everything they actually send
to the model on every turn.

Three strategies over ONE agent log of N different, growing pages:
  run_naive(pages)        - full transcript re-sent every turn          -&amp;gt; O(N^2)
  run_budget(pages)       - current page + one rolling summary          -&amp;gt; O(N)
  run_naive_cached(pages) - naive WITH an ideal prompt cache on the tail

Pure functions. No network, no I/O, deterministic: same fixtures in, same stdout
out (stable MD5). Token count is a deterministic len(text)//4 PROXY, not a real
tokenizer - on purpose, so anyone re-running gets the identical MD5 without
pinning a tiktoken version. The reported numbers are RATIOS, so the proxy
constant cancels out. It is a model, not your invoice.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;SUMMARY_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;          &lt;span class="c1"&gt;# one rolling summary of everything seen so far
&lt;/span&gt;&lt;span class="n"&gt;CACHE_READ&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;             &lt;span class="c1"&gt;# ideal prompt cache: cached tail billed at ~0.1x
&lt;/span&gt;&lt;span class="n"&gt;CACHE_WRITE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.25&lt;/span&gt;           &lt;span class="c1"&gt;# writing the new tail to cache costs ~1.25x once
&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;toks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deterministic token PROXY: ~4 chars = 1 token. No tokenizer dependency.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_naive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Full transcript re-sent every turn. billed(k) = sum(tokens 1..k). O(N^2).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;toks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# this page joins the transcript...
&lt;/span&gt;        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;              &lt;span class="c1"&gt;# ...and the WHOLE transcript is re-sent
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_budget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bounded window: current page + one rolling summary. billed is O(N).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nf"&gt;toks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;SUMMARY_TOKENS&lt;/span&gt;   &lt;span class="c1"&gt;# current page + fixed summary
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_naive_cached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Naive with an IDEAL prompt cache: old tail billed at CACHE_READ, new at write.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cached_tail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;toks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# re-read the whole already-cached tail cheap, then write this page once
&lt;/span&gt;        &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;cached_tail&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;CACHE_READ&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;CACHE_WRITE&lt;/span&gt;
        &lt;span class="n"&gt;cached_tail&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# ---- fixture (SYNTHETIC, labelled) ------------------------------------------
# N different pages an agent fetches across one task. Page size (~320 token
# proxy) is modeled on real listing/review page bodies from our production logs
# (the Trustpilot scraper alone has 962 runs), NOT measured per page here. A
# heavier page only widens the gap. Pages differ so the transcript keeps growing.
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;make_pages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PAGE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent context tax: cumulative billed INPUT tokens (proxy, synthetic)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page proxy ~= %d tokens, modeled on real page sizes (not measured here)&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;toks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;make_pages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

    &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
    &lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_pages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--- one session, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; different growing pages ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;turn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;naive billed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;budget billed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_naive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;tb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_budget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tb&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;naive&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_naive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_budget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_naive_cached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;N=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: naive=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;naive&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  budget=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;naive&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x, -&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;naive&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;with ideal prompt caching: naive=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x (raw was &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;naive&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x)&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;how the gap grows with N (same meter, both cumulative):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;N&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;naive&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;budget&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;raw x&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cached x&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;ps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_pages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;nv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_naive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ps&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;run_budget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ps&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;run_naive_cached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;nv&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;nv&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mf"&gt;5.1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cd&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;bd&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mf"&gt;7.1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>python</category>
    </item>
    <item>
      <title>Your AI Agent Trusts a 200 OK. I Logged How Often the Page Was Garbage</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Fri, 12 Jun 2026 18:09:10 +0000</pubDate>
      <link>https://dev.to/0012303/your-ai-agent-trusts-a-200-ok-i-logged-how-often-the-page-was-garbage-cfi</link>
      <guid>https://dev.to/0012303/your-ai-agent-trusts-a-200-ok-i-logged-how-often-the-page-was-garbage-cfi</guid>
      <description>&lt;p&gt;Yesterday I handed an agent a &lt;code&gt;web_fetch&lt;/code&gt; tool. It fetched a page, got back a 200 and a screenful of text, and confidently built a plan on it. The text was a Cloudflare "Just a moment..." screen. The agent never noticed.&lt;/p&gt;

&lt;p&gt;That's the failure I want to fix today. Not the fetch. The &lt;em&gt;trust&lt;/em&gt;. Your tool returns &lt;code&gt;status=200&lt;/code&gt; and a non-empty string, and your agent treats that string as "what the page said." Most of the time it is. Sometimes it's a challenge wall, an empty shell, or a body cut off mid-stream, and the agent reasons on garbage with full confidence and zero error.&lt;/p&gt;

&lt;p&gt;So here's a 40-line gate that sits between the fetch tool and the agent and answers one question: &lt;em&gt;is this blob usable as content at all?&lt;/em&gt; It tags every fetch &lt;code&gt;OK / BLOCKED / EMPTY_SHELL / TRUNCATED&lt;/code&gt; before your model ever reads it. Real, deterministic output below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; A web-fetch tool's &lt;code&gt;200 OK + non-empty body&lt;/code&gt; does not mean "usable content." That body can be an anti-bot challenge, an empty JS shell, an "access denied" notice, or a truncated stream, all served as 200. &lt;code&gt;sanity_check(text, url, status)&lt;/code&gt; runs zero network calls and tags the blob &lt;code&gt;OK / BLOCKED / EMPTY_SHELL / TRUNCATED&lt;/code&gt; before your model reads it. Garbage becomes an explicit signal, not a silent input to reasoning.&lt;/p&gt;

&lt;p&gt;This is for anyone building agents with web access (LangChain, a ReAct loop, an MCP tool, your own &lt;code&gt;while&lt;/code&gt; loop) who has watched the model be &lt;em&gt;confidently wrong&lt;/em&gt; and couldn't tell why. The why is often this: the page lied with a 200, and nothing checked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The artifact first: four verdicts on six blobs
&lt;/h2&gt;

&lt;p&gt;Here's the real output, copy-pasted, not cleaned up. The gate ran against six fixtures: one real captured page (&lt;code&gt;example.com&lt;/code&gt;) and five synthetic bodies I hand-wrote to reproduce failure classes I've hit in production. The synthetic ones are labeled &lt;code&gt;synthetic&lt;/code&gt; so you don't mistake them for live pulls.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VERDICT      KIND       URL
----------------------------------------------------------------
OK           real       https://example.com
                        → len=353B ratio=0.57
BLOCKED      synthetic  https://shop.example/product/991
                        → soft-block marker 'just a moment...' (status=200)
BLOCKED      synthetic  https://api.example/v2/orders
                        → soft-block marker 'access denied' (status=200)
EMPTY_SHELL  synthetic  https://app.example/dashboard
                        → visible≈0B ratio=0.00 (markup, no content)
TRUNCATED    synthetic  https://blog.example/crawl-bill
                        → no &amp;lt;/html&amp;gt; / mid-tag end …'were blunt: most of the spend was on &amp;lt;sp'
EMPTY_SHELL  synthetic  https://example.com/empty
                        → empty body (status=200)
----------------------------------------------------------------
1 of 6 blobs were usable content  ::  {'OK': 1, 'BLOCKED': 2, 'EMPTY_SHELL': 2, 'TRUNCATED': 1}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every one of those six rows was a &lt;code&gt;200&lt;/code&gt;. One was content. The other five were the kinds of &lt;code&gt;200&lt;/code&gt; that look fine to a fetch tool and ruin a downstream plan. The gate gives each a name your agent can branch on instead of a string it has to believe.&lt;/p&gt;

&lt;p&gt;I ran this on Python 3.13.5. No third-party imports, no network. Run &lt;code&gt;python3 fetch_sanity.py&lt;/code&gt; and you get the same bytes. I checked: two consecutive runs hash to the same &lt;code&gt;md5&lt;/code&gt;. That matters, and I'll come back to why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "200 and not empty" is the wrong success check
&lt;/h2&gt;

&lt;p&gt;Tool contracts lie by omission. A fetch tool's idea of success is usually two things: the HTTP status was 2xx, and the body wasn't empty. Both can be true while the body is useless.&lt;/p&gt;

&lt;p&gt;I run scrapers in production: roughly &lt;strong&gt;2,190 runs&lt;/strong&gt; across 32 published actors, the Trustpilot one alone at &lt;strong&gt;962 runs&lt;/strong&gt;. The failure that cost me the most debugging time wasn't a 500 or a timeout. Those are loud; you catch them. It was the &lt;code&gt;200&lt;/code&gt; that came back with a body that wasn't the page. Four shapes show up again and again:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The challenge wall.&lt;/strong&gt; Cloudflare's "Just a moment...", Akamai's "Access Denied", a generic "verify you are human", served with status 200, not 403. The bytes are real HTML. They're just not &lt;em&gt;your&lt;/em&gt; page.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The empty shell.&lt;/strong&gt; A single-page app ships &lt;code&gt;&amp;lt;div id="root"&amp;gt;&amp;lt;/div&amp;gt;&lt;/code&gt; and three script tags. The data renders in a browser. Your raw fetch got the skeleton. (Predicting &lt;em&gt;that&lt;/em&gt; before you fetch is its own post; see below.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The truncated body.&lt;/strong&gt; A size cap, a dropped connection, or a slow stream cut off, and you got the first 8 KB of a 40 KB page. Looks like a page. Ends mid-sentence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The literal empty 200.&lt;/strong&gt; Some servers answer a blocked or rate-limited request with &lt;code&gt;200&lt;/code&gt; and a body of &lt;code&gt;""&lt;/code&gt;. I once watched a scraper return empty arrays for days because nobody raised on a soft-blocked request that came back 200-and-nothing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the agentic twist that makes this worse than in a plain scraper. A scraper that gets garbage produces a bad row, and a human eventually eyeballs the table. An &lt;em&gt;agent&lt;/em&gt; that gets garbage feeds it straight into its own next decision (calls another tool, writes a summary, answers the user) with no human in the loop. The silent failure compounds. There's no exception to catch and no retry to trigger, because as far as every layer can tell, the fetch &lt;em&gt;succeeded&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The fix is not smarter prompting. It's a checkpoint that turns a silent 200 into an explicit verdict before the model reasons.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the gate actually checks
&lt;/h2&gt;

&lt;p&gt;The gate is a heuristic, and I'd rather hand you a blunt one I trust than a clever one I can't reason about. It checks three things, in order, and stops at the first hit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Soft-block markers.&lt;/strong&gt; A short list of strings that mean "this is a wall, not a page": &lt;code&gt;just a moment...&lt;/code&gt;, &lt;code&gt;enable javascript and cookies to continue&lt;/code&gt;, &lt;code&gt;attention required&lt;/code&gt;, &lt;code&gt;access denied&lt;/code&gt;, &lt;code&gt;verify you are human&lt;/code&gt;, &lt;code&gt;cf-ray&lt;/code&gt;, the captcha vendors. Match any (case-insensitive) and the verdict is &lt;code&gt;BLOCKED&lt;/code&gt;, even at status 200, &lt;em&gt;especially&lt;/em&gt; at status 200. These are strings I've watched come back from real targets dressed as success.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_BLOCK_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;low&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCKED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;soft-block marker &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="s"&gt; (status=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Visible-text-to-markup ratio.&lt;/strong&gt; Strip &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;style&amp;gt;&lt;/code&gt;, strip the remaining tags, measure what readable text is left versus the size of the whole blob. A real article is mostly words. An empty shell is mostly markup, a ratio near zero. The verdict is &lt;code&gt;EMPTY_SHELL&lt;/code&gt; in two cases: the body is literally empty, or it's markup with under ~200 bytes of visible text &lt;em&gt;and&lt;/em&gt; a ratio under &lt;code&gt;0.10&lt;/code&gt;. Note the second clause: a short blob that's almost all readable text (a terse JSON reply, a one-line message) has a high ratio, so it stays &lt;code&gt;OK&lt;/code&gt;, not &lt;code&gt;EMPTY_SHELL&lt;/code&gt;. The shell verdict is specifically for "lots of markup, almost no words."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Truncation.&lt;/strong&gt; If the body opened an &lt;code&gt;&amp;lt;html&amp;gt;&lt;/code&gt; tree and never closed it, or ends mid-tag, it got cut off. Verdict &lt;code&gt;TRUNCATED&lt;/code&gt;. The reason string even echoes the last 40 characters so you can see &lt;em&gt;where&lt;/em&gt; it stopped. &lt;code&gt;…'most of the spend was on &amp;lt;sp'&lt;/code&gt; is a body that died inside a &lt;code&gt;&amp;lt;span&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Nothing tripped? &lt;code&gt;OK&lt;/code&gt;. The agent may proceed.&lt;/p&gt;

&lt;p&gt;That's the whole decision surface. Notice what it is &lt;em&gt;not&lt;/em&gt; doing. It does not validate any field's value: no checksums, no ranges, no cross-field logic. It does not compare this fetch to a previous one or track a schema over time. It does not decide whether you needed a browser. It answers exactly one question, &lt;em&gt;is this a page at all?&lt;/em&gt;, and gets out of the way. The narrowness is the feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thresholds are deliberately blunt
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;200&lt;/code&gt; bytes of visible text. A ratio under &lt;code&gt;0.10&lt;/code&gt;. These aren't fitted to anything. They're "is there clearly almost no content here," with everything above left as &lt;code&gt;OK&lt;/code&gt;. Tune them to your traffic: a site that ships a thin-but-real &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt; and a 150-byte summary will trip &lt;code&gt;EMPTY_SHELL&lt;/code&gt; at these numbers, which might be a false alarm for you. Raise the floor. The point isn't my constants. It's that "is this usable content" is answerable from bytes you already have, before the model spends a token on them.&lt;/p&gt;

&lt;p&gt;And the soft-block list is a &lt;em&gt;denylist&lt;/em&gt;, so it's never complete. A challenge page with wording I haven't seen sails through as &lt;code&gt;OK&lt;/code&gt;. The list catches the vendors I've met across the fleet; it won't catch a custom wall some site rolls tomorrow. More on that in the failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wiring it into an agent loop
&lt;/h2&gt;

&lt;p&gt;The gate is a pure function, so it drops in wherever your tool returns. The pattern: fetch, gate, branch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_web_fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# your existing tool
&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sanity_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;observation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;                    &lt;span class="c1"&gt;# let the agent reason on it
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;observation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[fetch unusable: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# tell the agent the truth
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second branch is the whole point. Instead of handing the model a challenge page and hoping it notices, you hand it &lt;code&gt;[fetch unusable: BLOCKED] soft-block marker 'just a moment...'&lt;/code&gt;. Now the model knows the observation failed and can do something sane: try a different source, escalate to a browser-based fetch, or tell the user it couldn't read the page, instead of confidently summarizing a captcha screen.&lt;/p&gt;

&lt;p&gt;That's the move: convert a silent success into a spoken failure. Models are good at reacting to an error they can see. They're terrible at noticing one nobody told them about.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this lives in the chain (and the two siblings)
&lt;/h2&gt;

&lt;p&gt;This gate is one checkpoint in a longer pipeline, and it's easy to confuse with neighbors, so here are the seams.&lt;/p&gt;

&lt;p&gt;If you want to predict whether a page will even come back as a shell &lt;em&gt;before&lt;/em&gt; you fetch it, that's a different tool that reads the raw response shape. I wrote &lt;a href="https://blog.spinov.online/blog/does-this-page-need-a-browser/" rel="noopener noreferrer"&gt;a 30-line probe that tells you if a page needs a browser&lt;/a&gt;. That one runs &lt;em&gt;before&lt;/em&gt; the fetch. This one runs &lt;em&gt;after&lt;/em&gt;: it doesn't switch renderers or decide how to fetch, it just flags the agent that the blob it got back is a skeleton.&lt;/p&gt;

&lt;p&gt;And the fetch tool itself, the thing that produced the 200, is the &lt;a href="https://blog.spinov.online/blog/mcp-server-web-fetch-tool-for-ai-agents/" rel="noopener noreferrer"&gt;60-line MCP &lt;code&gt;web_fetch&lt;/code&gt; server&lt;/a&gt; I built yesterday. That post ends with an honest warning: it "does not beat anti-bot systems" and returns "nothing useful" on a challenged page. This gate is the answer to &lt;em&gt;what do you do with that nothing.&lt;/em&gt; You gave the agent eyes; now you teach it not to trust a forgery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this gate is wrong
&lt;/h2&gt;

&lt;p&gt;It's a heuristic. I'd rather tell you the misses than let you find them in an agent that's already in front of a user.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A challenge page in wording I haven't met.&lt;/strong&gt; The denylist catches Cloudflare, Akamai, the common captchas. A site that rolls its own "please hold" page with novel text will pass as &lt;code&gt;OK&lt;/code&gt;. There's no clean way around this short of a model-based classifier, which costs tokens on every fetch: the opposite of the point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A short page that is genuinely the content.&lt;/strong&gt; A 120-byte API error rendered as JSON, a terse status page, a stub doc: these can trip &lt;code&gt;EMPTY_SHELL&lt;/code&gt; when they're exactly what you asked for. The ratio test can't tell "empty shell" from "small real page." Tune the floor, or skip the ratio check for endpoints you know return short bodies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A truncated body that happens to close its tags.&lt;/strong&gt; If the cut-off landed right after a &lt;code&gt;&amp;lt;/html&amp;gt;&lt;/code&gt; (rare, but possible with a buffered proxy) the truncation check misses it. Length-versus-&lt;code&gt;Content-Length&lt;/code&gt; would catch that, but my fetch tool doesn't always have a reliable &lt;code&gt;Content-Length&lt;/code&gt;, so I left it out rather than half-implement it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A truncated body with no markup at all.&lt;/strong&gt; The truncation check only fires on HTML (it looks for an unclosed &lt;code&gt;&amp;lt;html&amp;gt;&lt;/code&gt; tree or a mid-tag cut). A JSON or plain-text response that got chopped mid-array has no tags to read, so it sails through as &lt;code&gt;OK&lt;/code&gt;. For JSON endpoints, pair this with a &lt;code&gt;json.loads&lt;/code&gt; in a &lt;code&gt;try&lt;/code&gt; and treat a parse failure as &lt;code&gt;TRUNCATED&lt;/code&gt; yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A blocked page with a 403.&lt;/strong&gt; This gate is for the &lt;em&gt;200-shaped&lt;/em&gt; lie. If your fetch tool already raises on 4xx/5xx (mine does), those never reach here, which is correct. The gate exists for the failures your status check waves through.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point is the design line. The gate doesn't replace your status handling. It catches the class your status handling is structurally blind to: success codes carrying non-content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "no network" isn't a footnote
&lt;/h2&gt;

&lt;p&gt;The gate makes zero network calls on purpose. &lt;code&gt;sanity_check(text, url, status)&lt;/code&gt; is a pure function of its inputs: same blob in, same verdict out. That buys three things. Tests pin a fixture to a verdict and never flake. The output above is reproducible, byte for byte (I checked the &lt;code&gt;md5&lt;/code&gt;). And a gate that called out to a live anti-bot site to "confirm" a block would add latency, egress, and a second thing that can fail. The blob already arrived. Everything we need to judge it is in the bytes.&lt;/p&gt;

&lt;p&gt;Same discipline as every checkpoint I ship: the browser probe, the schema canary, the field sanity checks are all pure functions over data you already have. It's what lets the next person re-run them and get my exact result instead of taking my word.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do on Monday
&lt;/h2&gt;

&lt;p&gt;Put the gate right after your fetch tool returns, before the result becomes an observation.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Gate every fetch, not just the ones that look wrong.&lt;/strong&gt; The whole problem is that the bad ones look fine. A 200 with a screen of HTML is exactly what a challenge wall looks like.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feed the verdict to the agent, not just your logs.&lt;/strong&gt; &lt;code&gt;[fetch unusable: BLOCKED]&lt;/code&gt; in the observation lets the model route around the failure. A line in your logs that the model never sees does not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tune the soft-block list to your own traffic.&lt;/strong&gt; Watch what your targets actually send back as a 200 for a week, and add the strings you see. The list in the file is my fleet's, not yours.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'll be straight about the limit: this catches the &lt;em&gt;common&lt;/em&gt; shapes of a 200 that isn't content. It will not catch a clever, novel wall, and it can't tell a tiny real page from an empty one without your help on the threshold. It's a floor, not a fortress. But most agent loops I've seen don't even have the floor. They hand the model the string and hope.&lt;/p&gt;

&lt;p&gt;Here's the open question I haven't solved cleanly. The &lt;code&gt;EMPTY_SHELL&lt;/code&gt; check and the soft-block list are both &lt;em&gt;content-shape&lt;/em&gt; signals: they look at the blob in isolation. But the strongest signal that a fetch failed is often &lt;em&gt;relative&lt;/em&gt;. This page is 200 bytes when the same URL gave 40 KB yesterday, or every URL on this host suddenly returns the same challenge string. That's drift across fetches, and a pure per-blob function can't see it without state. If you've found a cheap way to fold "this looks wrong &lt;em&gt;compared to last time&lt;/em&gt;" into a per-call gate without dragging a database into your agent loop, I genuinely want to see it.&lt;/p&gt;

&lt;p&gt;What's the worst &lt;code&gt;200 OK&lt;/code&gt; your agent ever believed, and what tipped you off that the page was garbage? 👇&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Follow for the next checkpoint from our production runs. I read every comment.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Full script (&lt;code&gt;fetch_sanity.py&lt;/code&gt;, stdlib only, no network, the exact file I ran):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;fetch_sanity.py — one gate between a web-fetch tool and your agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s reasoning.

A fetch tool can return HTTP 200 and a non-empty body that is still NOT content:
an anti-bot challenge page, an empty JS shell, an &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; notice, or a
body that got cut off mid-stream. The agent treats all of it as &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the page said
this&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; and plans on garbage. This function answers ONE question before reasoning:
is the returned blob usable as content at all?

    sanity_check(text, url, status) -&amp;gt; (verdict, reason)
    verdict in {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCKED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMPTY_SHELL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TRUNCATED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}

Pure function. No network, no I/O, deterministic: same (text, url, status) in,
same verdict out. Run it, diff it, trust it. It is a heuristic, not an oracle —
the &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Where it&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s wrong&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; section of the post is honest about the misses.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="c1"&gt;# Soft-block markers: a 200 body that is really a challenge / denial wall.
# Lowercased substring/regex match against the body. Each one is a real string
# I have seen come back with status 200 instead of 403.
&lt;/span&gt;&lt;span class="n"&gt;BLOCK_MARKERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;just a moment\.\.\.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# Cloudflare interstitial
&lt;/span&gt;    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enable javascript and cookies to continue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attention required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Cloudflare block page title
&lt;/span&gt;    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;access denied&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify you are (?:a )?human&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;are you a robot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complete the security check&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cf-ray&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                       &lt;span class="c1"&gt;# Cloudflare ray id leaks into the body
&lt;/span&gt;    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;px-captcha|hcaptcha|g-recaptcha|/recaptcha/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request unsuccessful\. incapsula&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;_BLOCK_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BLOCK_MARKERS&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;_TAG_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?is)&amp;lt;(script|style)\b.*?&amp;lt;/\1&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;_ANYTAG_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?s)&amp;lt;[^&amp;gt;]+&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_visible_ratio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fraction of the blob that is visible text after stripping script/style/tags.
    A real article is mostly words; an empty JS shell is mostly markup.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="n"&gt;stripped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_TAG_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;stripped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_ANYTAG_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stripped&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;visible&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stripped&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;visible&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sanity_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return (verdict, reason) for one fetched blob. No network calls.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;low&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# 1) BLOCKED — a soft-block / challenge / denial wall served as 200.
&lt;/span&gt;    &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_BLOCK_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;low&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCKED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;soft-block marker &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="s"&gt; (status=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# 2) EMPTY_SHELL — almost nothing to read. Either a literal empty body
&lt;/span&gt;    &lt;span class="c1"&gt;#    (200 + ""), or markup with the content rendered client-side: a raw
&lt;/span&gt;    &lt;span class="c1"&gt;#    fetch handed the agent a skeleton, not a page.
&lt;/span&gt;    &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_visible_ratio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;has_markup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;
    &lt;span class="n"&gt;visible_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;visible_len&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMPTY_SHELL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty body (status=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;has_markup&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EMPTY_SHELL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;visible≈&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;visible_len&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;B ratio=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (markup, no content)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# 3) TRUNCATED — body cut off mid-stream: opened an HTML tree but never
&lt;/span&gt;    &lt;span class="c1"&gt;#    closed it, or ends mid-tag / mid-word with no terminal punctuation.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;has_markup&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;opened_html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;low&lt;/span&gt;
        &lt;span class="n"&gt;closed_html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;low&lt;/span&gt;
        &lt;span class="n"&gt;ends_mid_tag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;[a-z][^&amp;gt;]*$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;opened_html&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;closed_html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;ends_mid_tag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;tail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;:].&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TRUNCATED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no &amp;lt;/html&amp;gt; / mid-tag end …&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="si"&gt;!r}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# 4) OK — nothing tripped. The agent may reason on this.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;len=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;B ratio=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="c1"&gt;# ---------------------------------------------------------------------------
# Fixtures. One REAL captured body (example.com, fetched once and pasted in as a
# byte-for-byte string so this stays offline/deterministic) + five SYNTHETIC
# bodies hand-written to reproduce failure classes I have hit in production.
# Synthetic ones are labeled (synthetic) so nobody mistakes them for a live pull.
# ---------------------------------------------------------------------------
&lt;/span&gt;
&lt;span class="c1"&gt;# Real: the actual body of https://example.com (RFC-style sample page, public,
# unchanging). Captured once, hardcoded so the gate needs no network.
&lt;/span&gt;&lt;span class="n"&gt;EXAMPLE_COM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;!doctype html&amp;gt;&amp;lt;html&amp;gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;Example Domain&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;&amp;lt;body&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;div&amp;gt;&amp;lt;h1&amp;gt;Example Domain&amp;lt;/h1&amp;gt;&amp;lt;p&amp;gt;This domain is for use in illustrative &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;examples in documents. You may use this domain in literature without prior &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coordination or asking for permission.&amp;lt;/p&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;p&amp;gt;&amp;lt;a href=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;https://www.iana.org/domains/example&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;More information...&amp;lt;/a&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/p&amp;gt;&amp;lt;/div&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Synthetic: a Cloudflare "Just a moment..." interstitial served with status 200.
&lt;/span&gt;&lt;span class="n"&gt;CF_CHALLENGE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;!DOCTYPE html&amp;gt;&amp;lt;html lang=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;en-US&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;Just a moment...&amp;lt;/title&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;meta http-equiv=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;refresh&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt; content=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;390&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;/head&amp;gt;&amp;lt;body&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;div class=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;main-wrapper&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;h1&amp;gt;example.com&amp;lt;/h1&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;h2&amp;gt;Checking if the site connection is secure&amp;lt;/h2&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;p&amp;gt;example.com needs to review the security of your connection before &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proceeding.&amp;lt;/p&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;!-- cf-ray: 8e2a1f0c9d4e7b21-FRA --&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Synthetic: an "Access Denied" wall (Akamai-style) returned as 200.
&lt;/span&gt;&lt;span class="n"&gt;ACCESS_DENIED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;html&amp;gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;Access Denied&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;&amp;lt;body&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;h1&amp;gt;Access Denied&amp;lt;/h1&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;p&amp;gt;You don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have permission to access this resource.&amp;lt;/p&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;p&amp;gt;Reference #18.abcd1234.1718200000&amp;lt;/p&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Synthetic: an empty SPA shell — all markup, the content arrives via JS.
&lt;/span&gt;&lt;span class="n"&gt;JS_SHELL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;!doctype html&amp;gt;&amp;lt;html&amp;gt;&amp;lt;head&amp;gt;&amp;lt;meta charset=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;link rel=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;stylesheet&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt; href=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;/static/app.css&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;/head&amp;gt;&amp;lt;body&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;div id=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;script src=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;/static/runtime.js&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;script src=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;/static/vendor.js&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;script src=&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;/static/main.js&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&amp;lt;/script&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Synthetic: a real article body that got cut off mid-stream (size cap / dropped
# connection). Opens &amp;lt;html&amp;gt;, never closes it, ends mid-tag.
&lt;/span&gt;&lt;span class="n"&gt;TRUNCATED_BODY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;!doctype html&amp;gt;&amp;lt;html&amp;gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;How we cut the crawl bill&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;body&amp;gt;&amp;lt;article&amp;gt;&amp;lt;h1&amp;gt;How we cut the crawl bill 82%&amp;lt;/h1&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;p&amp;gt;We started by measuring the per-run cost of a headless browser across &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;every target in the fleet. The first surprise was how often we paid for &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chrome on pages that answered a plain GET in eighty milliseconds. The second &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;was the cost of the pages that came back empty. We logged each run and &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tallied the verdicts, and the numbers were blunt: most of the spend was on &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;sp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;FIXTURES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="n"&gt;EXAMPLE_COM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;real&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://shop.example/product/991&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;CF_CHALLENGE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synthetic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.example/v2/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="n"&gt;ACCESS_DENIED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synthetic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://app.example/dashboard&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="n"&gt;JS_SHELL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synthetic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://blog.example/crawl-bill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="n"&gt;TRUNCATED_BODY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synthetic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/empty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synthetic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tally&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;VERDICT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;KIND&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;FIXTURES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sanity_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tally&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tally&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;usable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tally&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FIXTURES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;usable&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; blobs were usable content  ::  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tally&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Written by Aleksey Spinov. I run scrapers in production (2,190 runs across 32 published actors, the Trustpilot one at 962) and write up the failures the tutorials skip. The gate, the fixtures, and every verdict above were produced and verified by me on Python 3.13.5; the output shown is the real run, not a mock-up.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI disclosure: drafted with AI assistance. The code was run locally (stdlib only, no third-party deps, no network); the stdout in this post is the actual output, and the synthetic fixtures are labeled as synthetic throughout.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>11 Free AI APIs You Can Use Without Paying OpenAI (2026 Update)</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Fri, 12 Jun 2026 06:56:28 +0000</pubDate>
      <link>https://dev.to/0012303/11-free-ai-apis-you-can-use-without-paying-openai-2026-update-422h</link>
      <guid>https://dev.to/0012303/11-free-ai-apis-you-can-use-without-paying-openai-2026-update-422h</guid>
      <description>&lt;p&gt;You don't need an OpenAI bill to build with LLMs in 2026. There are still eleven providers with a genuinely free tier — real models, real endpoints, no credit card on most — and I pulled the &lt;strong&gt;current&lt;/strong&gt; limits, because half the listicles out there are quoting 2024 numbers that have since been cut.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;p&gt;Below are 11 LLM APIs with a free tier that still works in June 2026 — each with what's free, what it's best for, and the catch. &lt;strong&gt;Most&lt;/strong&gt; are OpenAI-compatible, so you swap three things (base URL, key, model) and your existing code runs against any of them. One honest warning up front: free limits are moving fast this year (Google in particular has tightened them), so treat every number here as "check the provider's page before you depend on it."&lt;/p&gt;

&lt;h2&gt;
  
  
  The list at a glance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;API&lt;/th&gt;
&lt;th&gt;What's free (June 2026)&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Google Gemini&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Flash, large context, no card — but limits cut in 2026, check AI Studio&lt;/td&gt;
&lt;td&gt;Big context, broad capability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Groq&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Llama 3.3 70B: ~30 RPM, ~1,000 req/day&lt;/td&gt;
&lt;td&gt;Fast short calls (LPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cerebras&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~1M tokens/day, 30 RPM, 8K-ctx cap, no card&lt;/td&gt;
&lt;td&gt;Very high throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;25+ &lt;code&gt;:free&lt;/code&gt; models, ~20 RPM, ~50/day&lt;/td&gt;
&lt;td&gt;Model variety, one endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GitHub Models&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenAI/other models for devs, tier limits&lt;/td&gt;
&lt;td&gt;Devs already on GitHub&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cloudflare Workers AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10,000 neurons/day at the edge&lt;/td&gt;
&lt;td&gt;Edge / serverless apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Mistral&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Experiment tier: all models, ~1B tok/mo, no card&lt;/td&gt;
&lt;td&gt;EU-hosted prototyping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SambaNova Cloud&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast Llama inference, no card&lt;/td&gt;
&lt;td&gt;Fast long-context calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Hugging Face&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless Inference, many models&lt;/td&gt;
&lt;td&gt;Open models beyond chat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Cohere&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free trial key (rate-limited)&lt;/td&gt;
&lt;td&gt;RAG: embed + rerank&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;NVIDIA build&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free credits on hosted models&lt;/td&gt;
&lt;td&gt;Trying many models fast&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  One code pattern for most of them
&lt;/h2&gt;

&lt;p&gt;Most of these speak the &lt;strong&gt;OpenAI Chat Completions&lt;/strong&gt; format — Groq, Cerebras, OpenRouter, Mistral, SambaNova, NVIDIA and GitHub Models, and Gemini/Cloudflare/Hugging Face now expose OpenAI-compatible endpoints too. So you don't learn a dozen SDKs — you point the OpenAI client at a different base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="c1"&gt;# swap these three lines per provider; everything else stays the same
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.groq.com/openai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# provider endpoint
&lt;/span&gt;    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_FREE_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                      &lt;span class="c1"&gt;# from the provider's console
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-3.3-70b-versatile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# provider's model id
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Say hi in 5 words.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To move from Groq to Cerebras, change &lt;code&gt;base_url&lt;/code&gt; to &lt;code&gt;https://api.cerebras.ai/v1&lt;/code&gt; and the model id. That's the whole migration — and it's also how you build a fallback chain: when one free tier rate-limits you, route to the next. (Cohere is the main exception — it has its own API for embed/rerank.)&lt;/p&gt;




&lt;h2&gt;
  
  
  The 11, with the real details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Google Gemini (AI Studio)
&lt;/h3&gt;

&lt;p&gt;Still a strong free option — &lt;strong&gt;Gemini 2.5 Flash&lt;/strong&gt;, a large context window, and no credit card. The big 2026 caveat: &lt;strong&gt;Google has tightened the free limits&lt;/strong&gt;, and the real cap is now "whatever AI Studio shows for your project" rather than a fixed public number (reports range widely, and extra keys don't add quota). Key from &lt;code&gt;aistudio.google.com&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; free-tier requests may be used to improve Google's models — keep proprietary data off it, and verify your live limit in AI Studio.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Groq
&lt;/h3&gt;

&lt;p&gt;Groq runs models on custom &lt;strong&gt;LPU&lt;/strong&gt; hardware and is one of the fastest free options for short calls. Published free limits for &lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt; are around &lt;strong&gt;30 RPM and 1,000 requests/day&lt;/strong&gt; with a per-minute token cap. OpenAI-compatible. Key from &lt;code&gt;console.groq.com&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; the per-minute token cap bites on long prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cerebras
&lt;/h3&gt;

&lt;p&gt;Cerebras is built for speed and has one of the most generous free volumes: &lt;strong&gt;~1,000,000 tokens/day&lt;/strong&gt;, &lt;strong&gt;30 RPM&lt;/strong&gt;, no card — across models including Llama 3.3 70B, Qwen3, and GPT-OSS 120B. Throughput is very high (multiple thousand tokens/sec on smaller models). OpenAI-compatible at &lt;code&gt;api.cerebras.ai/v1&lt;/code&gt;. Key from &lt;code&gt;cloud.cerebras.ai&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; a free-tier &lt;strong&gt;context cap (~8K tokens)&lt;/strong&gt; across models — fine for chat, tight for long documents.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. OpenRouter
&lt;/h3&gt;

&lt;p&gt;One endpoint, &lt;strong&gt;25+ models whose id ends in &lt;code&gt;:free&lt;/code&gt;&lt;/strong&gt; (Llama, DeepSeek, Qwen and more). Free limits are modest — roughly &lt;strong&gt;20 RPM and ~50 requests/day&lt;/strong&gt; on free models; adding ~$10 of credit once raises the free-model daily cap substantially. Endpoint &lt;code&gt;openrouter.ai/api/v1&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; free models get added and removed — pin the id and watch the changelog.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. GitHub Models
&lt;/h3&gt;

&lt;p&gt;If you have a GitHub account, you have free access to a rotating catalog of models (OpenAI's GPT family and others) for development. Limits depend on the model tier and your account. Auth with a GitHub token.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; it's meant for dev/prototyping, not production traffic; the catalog and limits change.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Cloudflare Workers AI
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;10,000 neurons/day&lt;/strong&gt; of free inference running &lt;strong&gt;at the edge&lt;/strong&gt; — great when your app already lives on Workers/Pages. Call models like &lt;code&gt;@cf/meta/llama-3.1-8b-instruct&lt;/code&gt;; an OpenAI-compatible endpoint is available too.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; "neurons" is its own unit — a heavy model burns the daily budget faster than a small one.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Mistral (La Plateforme)
&lt;/h3&gt;

&lt;p&gt;Mistral's &lt;strong&gt;Experiment tier&lt;/strong&gt; gives free, rate-limited access to its models (including larger ones and Codestral) for prototyping — no credit card, just a verified phone number, with monthly token quotas that are generous for development. Key from &lt;code&gt;console.mistral.ai&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; it's an experimentation tier — production is pay-as-you-go per token.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. SambaNova Cloud
&lt;/h3&gt;

&lt;p&gt;Free, &lt;strong&gt;fast&lt;/strong&gt; Llama inference with no credit card — strong on longer-context calls. OpenAI-compatible. Key from &lt;code&gt;cloud.sambanova.ai&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; which models are available shifts; check the catalog.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Hugging Face
&lt;/h3&gt;

&lt;p&gt;The free &lt;strong&gt;serverless Inference&lt;/strong&gt; option lets you call many open models — not just chat, but embeddings, vision, audio, classification. Token from &lt;code&gt;huggingface.co&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; cold starts and per-model limits; not built for steady high QPS.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Cohere
&lt;/h3&gt;

&lt;p&gt;A free, rate-limited trial key for &lt;code&gt;command&lt;/code&gt;, &lt;code&gt;embed&lt;/code&gt;, and &lt;code&gt;rerank&lt;/code&gt;. It's here for one specific reason: &lt;strong&gt;Cohere's embed + rerank make a genuinely good free RAG backbone&lt;/strong&gt;, not just another chat endpoint. Key from &lt;code&gt;dashboard.cohere.com&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; trial-tier limits are modest — fine for building, not for serving users.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. NVIDIA (build.nvidia.com)
&lt;/h3&gt;

&lt;p&gt;Free &lt;strong&gt;credits&lt;/strong&gt; to call a large catalog of hosted models through an OpenAI-compatible endpoint (&lt;code&gt;integrate.api.nvidia.com/v1&lt;/code&gt;) — a quick way to try many models without a dozen separate signups.&lt;br&gt;
&lt;strong&gt;Catch:&lt;/strong&gt; it's credits, not a permanent quota — they run out.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to actually use these (without lying to yourself about limits)
&lt;/h2&gt;

&lt;p&gt;The free-API market splits into three buckets, and mixing them up is how people get a nasty surprise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free-quota-style tiers&lt;/strong&gt; — Groq, Cerebras, Cloudflare, OpenRouter &lt;code&gt;:free&lt;/code&gt;. The more durable options — but still verify before you build on them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small monthly credits&lt;/strong&gt; — NVIDIA and similar. Good for trials, not a backend.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signup trials&lt;/strong&gt; — short-lived. Don't architect around them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical move: treat free tiers as &lt;strong&gt;routing lanes, not one backend.&lt;/strong&gt; Send fast short calls to Groq/Cerebras, big-context jobs to Gemini, RAG embeddings to Cohere, and let a fallback chain (that one OpenAI-compatible client above) hop to the next lane when one rate-limits you.&lt;/p&gt;

&lt;p&gt;Two honest caveats for all of them: &lt;strong&gt;some free tiers may use your inputs to improve their models&lt;/strong&gt; — check each provider's policy and keep proprietary data off them; and these numbers move — what's generous today can be cut next quarter (Google already did in 2026), so verify before you commit.&lt;/p&gt;




&lt;p&gt;Which of these are you actually running in production vs just testing? And did I miss a free tier that's been carrying your side projects? Drop it 👇 — I'll re-test and fold it into the next update.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Give Your AI Agent a Web-Fetch Tool: a 60-Line MCP Server (Free, Self-Hosted)</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Thu, 11 Jun 2026 18:13:53 +0000</pubDate>
      <link>https://dev.to/0012303/give-your-ai-agent-a-web-fetch-tool-a-60-line-mcp-server-free-self-hosted-23g4</link>
      <guid>https://dev.to/0012303/give-your-ai-agent-a-web-fetch-tool-a-60-line-mcp-server-free-self-hosted-23g4</guid>
      <description>&lt;p&gt;Every MCP web-access tutorial I read this month pointed at a paid API.&lt;/p&gt;

&lt;p&gt;You don't need one. To let an AI agent read a public web page, sixty lines on the official MCP Python SDK give you a self-hosted &lt;code&gt;web_fetch&lt;/code&gt; tool — running on your machine, no key, no per-call bill.&lt;/p&gt;

&lt;p&gt;I built it, ran it, and pasted the real terminal output below. The catch isn't the wiring (that part is easy). It's the four defaults the tutorials leave out — the ones that turn a toy into something you'd actually point an agent at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick answer:&lt;/strong&gt; A Model Context Protocol (MCP) server exposes tools an LLM agent can call. With &lt;code&gt;pip install mcp&lt;/code&gt;, one &lt;code&gt;@mcp.tool()&lt;/code&gt; function, and &lt;code&gt;mcp.run()&lt;/code&gt;, you get a working &lt;code&gt;web_fetch(url) -&amp;gt; clean text&lt;/code&gt; tool over stdio in ~60 lines. Self-hosted, free, and returning text instead of raw HTML. The work is in the guardrails: timeout, size cap, and an SSRF check.&lt;/p&gt;

&lt;p&gt;This is for anyone building agents or RAG who keeps hitting "give your model live web access — here's our API." If your target is docs, articles, RSS, or JSON endpoints that answer a plain GET, you don't have to pay for that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The artifact first: what the agent actually receives
&lt;/h2&gt;

&lt;p&gt;Here's the real round-trip (terminal output, reformatted for readability — the raw &lt;code&gt;print()&lt;/code&gt; repr is denser). I started the server, connected an MCP client over stdio, asked it to list tools, then called &lt;code&gt;web_fetch&lt;/code&gt; on &lt;code&gt;example.com&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== TOOLS THE AGENT SEES ===
- web_fetch: Fetch a public web page and return clean readable text (no raw HTML).

=== call_tool web_fetch('https://example.com') ===
isError: False
Example Domain Example Domain This domain is for use in documentation examples
without needing permission. Avoid use in operations. Learn more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The agent asked for a URL and got back readable prose — not a wall of &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; soup. No tags, no &lt;code&gt;&amp;lt;style&amp;gt;&lt;/code&gt; block, no nav chrome.&lt;/p&gt;

&lt;p&gt;Stack I ran this on: &lt;code&gt;mcp&lt;/code&gt; &lt;strong&gt;1.27.2&lt;/strong&gt; (&lt;code&gt;pip show mcp&lt;/code&gt;, installed 2026-06-11), &lt;code&gt;httpx&lt;/code&gt; 0.28.1, Python 3.13.5. The MCP SDK API moves between versions, so I'll flag the parts that matter as we go.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why free / self-hosted, and why I'm writing this
&lt;/h2&gt;

&lt;p&gt;Here's the context. Pierluigi Vinciguerra runs The Web Scraping Club — one of the most-read voices in this niche. On 2026-06-07 he published a walkthrough titled, roughly, &lt;em&gt;how to give Claude real-time web access with the Decodo MCP&lt;/em&gt;. Good post. But Decodo is a &lt;strong&gt;paid&lt;/strong&gt; service. The same week, an HN front-pager pitched a "Bot Browser" MCP server that "saves 90% of tokens." The demand is obvious. The default answer everyone reaches for is a vendor.&lt;/p&gt;

&lt;p&gt;For a big chunk of cases, that's overkill.&lt;/p&gt;

&lt;p&gt;"Self-hosted" means the server runs as a local process you started. The traffic goes out from your IP, the rate limits are your own, and no third party logs which URLs your agent reads. For internal docs, public APIs, blog posts, changelogs, RSS — a plain GET is all you need, and a vendor in the middle is a cost and a dependency you didn't have to take on.&lt;/p&gt;

&lt;p&gt;I'll be straight about the boundary, because this is where honesty matters: &lt;strong&gt;this server does not beat anti-bot systems.&lt;/strong&gt; No headless browser, no fingerprint rotation, no JavaScript execution. Hit a Cloudflare-challenged or JS-rendered page and it returns nothing useful. That's a different tool for a different day. What this &lt;em&gt;does&lt;/em&gt; cover is the long, boring, very common tail of sites that just answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The server, line by line
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;pip install mcp httpx&lt;/code&gt; and you're ready. The whole thing is one file.&lt;/p&gt;

&lt;p&gt;The shape: create a &lt;code&gt;FastMCP&lt;/code&gt; instance, decorate a function with &lt;code&gt;@mcp.tool()&lt;/code&gt;, and the SDK turns the function's signature and docstring into a tool schema the agent can discover. Run it over stdio with &lt;code&gt;mcp.run()&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# server.py — runnable local:  pip install mcp httpx  →  python server.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ipaddress&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web-fetch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;MAX_CHARS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;  &lt;span class="c1"&gt;# guardrail: don't blow up the agent's context window
&lt;/span&gt;
&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;web_fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch a public web page and return clean readable text (no raw HTML).
    Use this when you need the current contents of a URL.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;_is_public_http_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
               &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent-web-fetch/0.1 (+https://blog.spinov.online; contact you@example.com)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;follow_redirects&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;              &lt;span class="c1"&gt;# 4xx/5xx -&amp;gt; raise, agent sees a real error
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_html_to_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_CHARS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;MAX_CHARS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;[truncated at &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;MAX_CHARS&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# stdio transport by default
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two helpers (&lt;code&gt;_is_public_http_url&lt;/code&gt;, &lt;code&gt;_html_to_text&lt;/code&gt;) and imports round it out to exactly 60 lines. The full file is at the end. The docstring on &lt;code&gt;web_fetch&lt;/code&gt; is not a comment — it becomes the tool description the model reads when deciding whether to call it, which is why the &lt;code&gt;list_tools&lt;/code&gt; output above echoes that first line back. Write it for the agent, not for you.&lt;/p&gt;

&lt;p&gt;A version note, because this bites people: I ran this on &lt;code&gt;mcp&lt;/code&gt; 1.27.2. On that version &lt;code&gt;from mcp.server.fastmcp import FastMCP&lt;/code&gt;, the &lt;code&gt;@mcp.tool()&lt;/code&gt; decorator, and &lt;code&gt;mcp.run()&lt;/code&gt; all exist and behave as shown. The low-level &lt;code&gt;Server&lt;/code&gt; API and some helper signatures have shifted across releases — if you're on something older or newer, check &lt;code&gt;pip show mcp&lt;/code&gt; and lean on the declarative &lt;code&gt;FastMCP&lt;/code&gt; path. It's the most stable surface between versions, and it's what keeps this under sixty lines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Talking to it from a client
&lt;/h3&gt;

&lt;p&gt;To prove the agent can actually reach the tool, I connected over stdio with the SDK's own client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# client.py — runnable local
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.client.session&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ClientSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.client.stdio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StdioServerParameters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stdio_client&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StdioServerParameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;stdio_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nf"&gt;as &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ClientSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_tools&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;                       &lt;span class="c1"&gt;# tool is discoverable
&lt;/span&gt;            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_fetch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                          &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One snag worth saving you: I first wrote &lt;code&gt;command="python"&lt;/code&gt;, and on a clean venv that raised &lt;code&gt;FileNotFoundError: 'python'&lt;/code&gt; — the binary wasn't on PATH, only &lt;code&gt;python3&lt;/code&gt; was. &lt;code&gt;sys.executable&lt;/code&gt; points at the interpreter already running, so it just works. Small thing, ten minutes lost. Prefer it over a bare &lt;code&gt;"python"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Don't want to write a client? &lt;code&gt;npx @modelcontextprotocol/inspector python server.py&lt;/code&gt; gives you a UI to poke the tool, or you register &lt;code&gt;server.py&lt;/code&gt; as a local MCP server in Claude Desktop / Claude Code and call it from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four defaults that aren't decoration
&lt;/h2&gt;

&lt;p&gt;This is the part I actually care about, and the reason I bothered writing instead of just linking the quickstart.&lt;/p&gt;

&lt;p&gt;The naive version of this tool is three lines: &lt;code&gt;httpx.get(url).text&lt;/code&gt;, return it, done. It demos fine. Then you point a real agent at the open web and it falls over in ways the quickstart never warned you about. These four defaults come straight off our scraping fleet — across roughly &lt;strong&gt;2,190 production runs&lt;/strong&gt; on 32 published actors (our Trustpilot scraper alone has logged &lt;strong&gt;962 runs&lt;/strong&gt;), and every one of these earned its place by causing pain when it was missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. A polite, contactable User-Agent.&lt;/strong&gt; The default &lt;code&gt;httpx&lt;/code&gt;/&lt;code&gt;python-requests&lt;/code&gt; UA is a fast way to get silently throttled or blocked. A UA that says who you are and how to reach you is the cheapest goodwill there is — and on a few of our actors it was the single line that flipped a site from 403 to 200. Feature → so what: your agent stops getting ghosted by servers that block unknown bots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. &lt;code&gt;timeout&lt;/code&gt; + &lt;code&gt;raise_for_status()&lt;/code&gt;.&lt;/strong&gt; An agent that hangs on a slow server is worse than one that errors — it freezes the whole tool call with no signal. A 15-second timeout plus raising on 4xx/5xx means a bad URL surfaces as a real error the agent can react to, instead of an empty string it confidently treats as "the page said nothing." Silent garbage is the expensive failure mode; I've watched a scraper return empty arrays for days because nobody raised on a 429.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. A size cap.&lt;/strong&gt; &lt;code&gt;MAX_CHARS = 8000&lt;/code&gt; is a guardrail against your agent's context window, not the network. Some pages are enormous. Without a cap, one fetch of a bloated page can eat half your context and your budget with it. Truncating with a visible marker is honest and bounded. Tune the number to your model; the principle doesn't change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Clean text instead of raw HTML.&lt;/strong&gt; The tool strips tags and returns prose. On &lt;code&gt;example.com&lt;/code&gt; — a trivial page — that took the payload from 559 raw HTML characters down to 142 of clean text, about 3.9× smaller (one page; real sites skew far higher because of nav, scripts, and inline styling). Why it matters for your token bill is its own rabbit hole, and I already measured it on real pages in &lt;a href="https://blog.spinov.online/blog/raw-html-is-a-token-tax-i-measured-it/" rel="noopener noreferrer"&gt;Raw HTML Is a Token Tax — I Measured It&lt;/a&gt;. Short version: agents pay for every HTML character they're handed and read almost none of it. Hand them text.&lt;/p&gt;

&lt;h2&gt;
  
  
  The guardrail people skip: SSRF
&lt;/h2&gt;

&lt;p&gt;One more, and it's the one I'd flag in code review. A &lt;code&gt;web_fetch&lt;/code&gt; tool an LLM controls is a request your model can aim &lt;em&gt;anywhere&lt;/em&gt; — including &lt;code&gt;http://169.254.169.254/&lt;/code&gt;, the cloud metadata endpoint that leaks credentials, or &lt;code&gt;http://localhost:6379&lt;/code&gt; to poke your Redis. That's Server-Side Request Forgery, and an agent can be talked into it by a malicious page telling it to "fetch this URL."&lt;/p&gt;

&lt;p&gt;So before any request, the server resolves the host and refuses private, loopback, link-local, and reserved addresses. Real output, same run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== call_tool web_fetch('http://169.254.169.254/') ===
isError: True
Error executing tool web_fetch: refusing to fetch a private/internal address
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the metadata IP getting turned away. The check is ~10 lines and it's not optional if the URL can come from a model. It is &lt;em&gt;not&lt;/em&gt; a complete SSRF defense — DNS rebinding and redirect-to-internal are still live concerns, and &lt;code&gt;follow_redirects=True&lt;/code&gt; means you'd want to re-check the final hop in anything serious. But refusing the obvious internal targets is the floor, and most toy fetch tools don't even have that.&lt;/p&gt;

&lt;p&gt;And the de-tagger is honest about being a toy. The regex &lt;code&gt;_html_to_text&lt;/code&gt; is fine for a demo; it is not a real content extractor. For production, swap it for &lt;a href="https://trafilatura.readthedocs.io/" rel="noopener noreferrer"&gt;trafilatura&lt;/a&gt; or readability-lxml, which actually find the article body and drop boilerplate. I left the regex in so the file stays one dependency and sixty lines — but I'm telling you it's the first thing to replace.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you've got, and the honest edges
&lt;/h2&gt;

&lt;p&gt;Sixty lines, &lt;code&gt;pip install mcp httpx&lt;/code&gt;, and your agent has a &lt;code&gt;web_fetch&lt;/code&gt; tool that returns readable text, identifies itself politely, won't hang, won't flood your context, and refuses a direct metadata-IP URL. Free. On your machine. No vendor.&lt;/p&gt;

&lt;p&gt;Where it stops, plainly: no JavaScript, no anti-bot evasion, no proxies, a toy extractor by default, and an SSRF guard that's a floor, not a fortress. For the docs/API/article tail, that's plenty. For Cloudflare-walled or JS-heavy targets, you're in headless-browser territory — a separate post.&lt;/p&gt;

&lt;p&gt;The source for &lt;code&gt;server.py&lt;/code&gt; (exactly 60 lines, the file I ran) is below, verified against &lt;code&gt;mcp&lt;/code&gt; 1.27.2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# server.py — a minimal MCP server that gives an AI agent ONE tool: web_fetch.
# runnable local:  pip install mcp httpx  →  python server.py
# The defaults here (UA, timeout, redirects, raise_for_status, size cap) are the
# same ones I run across our scraping fleet — not decoration, they stop real pain.
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ipaddress&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlparse&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web-fetch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;MAX_CHARS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;  &lt;span class="c1"&gt;# guardrail: don't blow up the agent's context window
&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_is_public_http_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Reject non-http(s) and private/loopback targets (a small SSRF guard).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urlparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scheme&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;only http/https URLs are allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hostname&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ipaddress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gethostbyname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gaierror&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cannot resolve host: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_private&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_loopback&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_link_local&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_reserved&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refusing to fetch a private/internal address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_html_to_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Toy de-tag: good enough for a demo, NOT a real extractor.
    For production use trafilatura or readability-lxml instead.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;(script|style|noscript)[^&amp;gt;]*&amp;gt;.*?&amp;lt;/\1&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;[^&amp;gt;]+&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;web_fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetch a public web page and return clean readable text (no raw HTML).
    Use this when you need the current contents of a URL.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;_is_public_http_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;# Identify yourself. A polite, contactable UA is the cheapest way to
&lt;/span&gt;        &lt;span class="c1"&gt;# not get silently blocked — and it's the right thing to do.
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent-web-fetch/0.1 (+https://blog.spinov.online; contact you@example.com)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;follow_redirects&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# 4xx/5xx -&amp;gt; raise, so the agent sees a real error
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_html_to_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_CHARS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;MAX_CHARS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;[truncated at &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;MAX_CHARS&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chars]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# stdio transport by default
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I keep coming back to one design question and don't have a clean answer: when you hand an agent a tool, how much should the &lt;em&gt;tool&lt;/em&gt; enforce versus how much you trust the model to behave? I put the SSRF check and size cap in the tool because I don't trust prompt-level rules to hold under adversarial input. But that bloats every tool with guardrail code. Where do you draw that line — in the tool, in a sandbox around it, or in the agent's policy?&lt;/p&gt;

&lt;p&gt;What's the first tool you'd hand your agent — and what guardrail would you refuse to ship without? 👇&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Follow for the next teardown from our production runs. I read every comment.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Written with AI assistance; all code was run and every output above is real terminal output, not generated. Tested on &lt;code&gt;mcp&lt;/code&gt; 1.27.2 / Python 3.13.5 on 2026-06-11. Source: the official MCP docs at &lt;a href="https://modelcontextprotocol.io/docs/develop/build-server" rel="noopener noreferrer"&gt;modelcontextprotocol.io&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>ai</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Your Scraper Re-Downloads Everything. Most Didn't Change.</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Tue, 09 Jun 2026 18:11:57 +0000</pubDate>
      <link>https://dev.to/0012303/your-scraper-re-downloads-everything-most-didnt-change-1chd</link>
      <guid>https://dev.to/0012303/your-scraper-re-downloads-everything-most-didnt-change-1chd</guid>
      <description>&lt;p&gt;Your scheduled scraper re-downloaded the whole corpus last night. A few thousand records. About forty of them actually changed since the run before.&lt;/p&gt;

&lt;p&gt;It downloaded all of them anyway, because it had no idea which forty.&lt;/p&gt;

&lt;p&gt;That's the failure I want to talk about. Not a block, not a crash, not bad data. A scraper that works perfectly and still does an enormous amount of work it didn't need to — because it decides &lt;em&gt;what to fetch&lt;/em&gt; after fetching, instead of before.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A full re-scrape is the dumb default. On a scheduled re-run you pay to re-download records that didn't change since last time, and the scraper can't skip them because it learns "did this change" only &lt;em&gt;after&lt;/em&gt; the request.&lt;/li&gt;
&lt;li&gt;The cheap re-scrape isn't "download faster." It's "don't download what didn't change" — and you decide that from a manifest of what you knew last run, before the first request goes out.&lt;/li&gt;
&lt;li&gt;Three levers, in priority order: a trustworthy validator (ETag / Last-Modified) → &lt;code&gt;CONDITIONAL&lt;/code&gt; (a 304 transfers zero body); no validator but a stored content hash → &lt;code&gt;FETCH&lt;/code&gt; then compare; new URL → &lt;code&gt;FETCH&lt;/code&gt;. RFC 7232 is the whole spec.&lt;/li&gt;
&lt;li&gt;The trap I hit in production: a weak, per-request-rotating ETag never returns 304. It fakes a 200 every time, so you "saved" nothing and trusted a validator that lies. The planner has to flag it and fall back to content-hash.&lt;/li&gt;
&lt;li&gt;The savings number below is a deterministic synthetic manifest you run yourself — not a measurement of any real site. What's real is the exposure: 2,190 production runs across 32 actors, one Trustpilot scraper at 962.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why I get to talk about re-scrape cost
&lt;/h2&gt;

&lt;p&gt;I run production scrapers. Thirty-two published actors, 2,190 runs logged in production as of this week — that's the live counter on my own Apify dashboard, not a rounded brag. One of them, a Trustpilot review scraper, has run 962 times.&lt;/p&gt;

&lt;p&gt;That 962 is the relevant number here. A review scraper isn't a one-shot job. It runs on a schedule, against the same companies, over and over — which means it re-visits the same pages it already saw last week, and the week before. Most of those pages have one new review, or none. Re-pulling the unchanged bulk on every scheduled run is, in my experience, the quiet majority of the work a long-lived scraper does. Not the failures. The redundant success.&lt;/p&gt;

&lt;p&gt;Now the honest part. I do &lt;strong&gt;not&lt;/strong&gt; have a clean, published figure for the compute-units, proxy-GB, or wall-clock time of a full re-crawl versus a delta crawl on our real corpus. That number is n/d, and I'm not going to invent one to make a point. What I can give you is the &lt;em&gt;mechanism&lt;/em&gt;, on a manifest you can run in two seconds and get the exact count I did. The 2,190 / 962 is the real part — it's the reason I think about this at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is not the other failures in the series
&lt;/h2&gt;

&lt;p&gt;These failures rhyme, and the fixes don't. Draw the boundary hard.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Not this&lt;/th&gt;
&lt;th&gt;That post is about&lt;/th&gt;
&lt;th&gt;This post is about&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://blog.spinov.online/blog/raw-html-is-a-token-tax-i-measured-it/" rel="noopener noreferrer"&gt;Raw HTML token tax&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;the cost of &lt;strong&gt;one&lt;/strong&gt; fetch (raw HTML → markdown tokens, a polite conditional GET)&lt;/td&gt;
&lt;td&gt;how many fetches to make &lt;strong&gt;at all&lt;/strong&gt; on a re-run — work at the level of the whole record set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://blog.spinov.online/blog/scraping-text-is-the-easy-10-percent-dedup-and-decay/" rel="noopener noreferrer"&gt;Corpus near-duplicates&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;removing duplicates &lt;strong&gt;inside&lt;/strong&gt; data you already collected&lt;/td&gt;
&lt;td&gt;not re-collecting what didn't change, &lt;strong&gt;before&lt;/strong&gt; collection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://blog.spinov.online/blog/your-scraper-died-at-row-12000/" rel="noopener noreferrer"&gt;Resume a dead run&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;finishing an &lt;strong&gt;interrupted&lt;/strong&gt; run without re-doing work&lt;/td&gt;
&lt;td&gt;deliberately not re-pulling unchanged records on a &lt;strong&gt;successful, scheduled&lt;/strong&gt; re-run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://blog.spinov.online/blog/your-scraper-passes-every-run-its-still-rotting/" rel="noopener noreferrer"&gt;Yield decay over time&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;detecting&lt;/strong&gt; that output is silently rotting vs a baseline&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;deciding&lt;/strong&gt; what to re-collect to keep the corpus fresh cheaply&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://blog.spinov.online/blog/your-scraper-got-clean-data-the-site-lied/" rel="noopener noreferrer"&gt;Poisoned data&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;values that are valid in form but &lt;strong&gt;false&lt;/strong&gt; in fact&lt;/td&gt;
&lt;td&gt;the &lt;em&gt;volume&lt;/em&gt; of re-collection — nothing about trusting the content&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The thin line is with the token-tax post. A conditional GET shows up there too, as politeness on a single request. Here a conditional GET is just &lt;em&gt;one of three levers&lt;/em&gt; in a plan over the whole set, and it's not even the center of gravity — the center is &lt;code&gt;SKIP&lt;/code&gt; by manifest. The moment a re-scrape post starts reading like "how to send one conditional GET," it has drifted into the token-tax post. The question that keeps it here: &lt;em&gt;for N records on a scheduled re-run, how many do I touch?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;New axis: &lt;strong&gt;work&lt;/strong&gt;. Not the cost of one fetch, not deduping, not resume, not detection. The size of the job on a planned repeat.&lt;/p&gt;

&lt;h2&gt;
  
  
  The decision belongs before the request, not after
&lt;/h2&gt;

&lt;p&gt;Here's the whole reframe. A naive scraper's loop is: fetch the page, then notice it's identical to last time. The noticing is too late — the bytes already crossed the wire, the proxy already burned, the parser already ran.&lt;/p&gt;

&lt;p&gt;A planner inverts it. Before any request, it reads last run's manifest — one small row per record with what you knew: &lt;code&gt;{url, etag, last_modified, content_hash}&lt;/code&gt; — and assigns a plan. RFC 7232 gives you two of the three levers for free, and they've been in HTTP for years; almost nobody uses them on the re-scrape path.&lt;/p&gt;

&lt;p&gt;The priority order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trustworthy validator → &lt;code&gt;CONDITIONAL&lt;/code&gt;.&lt;/strong&gt; If last run gave you an ETag or a Last-Modified, send &lt;code&gt;If-None-Match&lt;/code&gt; / &lt;code&gt;If-Modified-Since&lt;/code&gt;. If the page is unchanged the server answers &lt;code&gt;304 Not Modified&lt;/code&gt; — status line, no body. You confirmed "nothing changed" for the cost of a header round-trip, not a full download.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No validator, but a stored content hash → &lt;code&gt;FETCH&lt;/code&gt;, then compare.&lt;/strong&gt; Some servers give you nothing to precondition on. You still hold last run's body hash. Fetch, hash the new body, and if it matches, &lt;em&gt;stop&lt;/em&gt; — don't re-parse, don't re-write downstream, don't re-embed. You paid for the bytes but skipped everything after.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New URL → &lt;code&gt;FETCH&lt;/code&gt;.&lt;/strong&gt; Not in the manifest, so there's nothing to compare. Pull it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The planner's only job is to assign one of &lt;code&gt;FETCH&lt;/code&gt; / &lt;code&gt;SKIP&lt;/code&gt; / &lt;code&gt;CONDITIONAL&lt;/code&gt; to every record, from the manifest alone, before the loop starts. That plan is the artifact. Everything downstream just executes it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap: a validator that lies
&lt;/h2&gt;

&lt;p&gt;I trusted ETags completely until one source taught me not to.&lt;/p&gt;

&lt;p&gt;A conditional GET assumes the validator is &lt;em&gt;stable&lt;/em&gt;: same content, same ETag, so an unchanged page returns 304. But RFC 7232 §2.1 explicitly allows &lt;strong&gt;weak&lt;/strong&gt; validators — metadata "that might not change for every change to the representation data … or a desire of the resource owner to group representations by some self-determined set of equivalency" (&lt;a href="https://www.rfc-editor.org/rfc/rfc7232#section-2.1" rel="noopener noreferrer"&gt;RFC 7232 §2.1&lt;/a&gt;). A weak ETag is written &lt;code&gt;W/"..."&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What bit me was worse than weak — it rotated. The server emitted a different ETag on &lt;em&gt;every response&lt;/em&gt; for the same unchanged page. So my &lt;code&gt;If-None-Match&lt;/code&gt; never matched, the server never returned 304, and every conditional request came back as a fresh 200 with a full body. I'd "optimized" the re-scrape and saved exactly nothing on that source, while believing I had. The data was fine. The plan was a lie.&lt;/p&gt;

&lt;p&gt;The fix is to treat the validator as untrustworthy and downgrade: when an ETag is weak or known to rotate, don't plan &lt;code&gt;CONDITIONAL&lt;/code&gt;, plan &lt;code&gt;FETCH&lt;/code&gt; and compare the content hash after. You lose the 304 savings on that one source, but you stop trusting a number that can't be trusted. The planner below carries that downgrade as an explicit branch — it's the production detail that turns a tutorial into something that survives contact with a real site.&lt;/p&gt;

&lt;h2&gt;
  
  
  The planner, in pieces
&lt;/h2&gt;

&lt;p&gt;Pure stdlib, no network, no browser, no keys, no &lt;code&gt;random&lt;/code&gt;. A deterministic synthetic manifest stands in for last run's stored state, so you get the exact output I did. The transport is irrelevant to the mechanism — the planning is just a decision over a table.&lt;/p&gt;

&lt;p&gt;First, the decision for a single record. This is the entire idea:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;plan_for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return (plan, note) for one manifest record, BEFORE any request.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etag_weak&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# A weak/rotating ETag never produces a 304. Don't trust it.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FETCH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;untrustworthy_validator -&amp;gt; hash-compare&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;etag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_modified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CONDITIONAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;send If-None-Match / If-Modified-Since&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content_hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FETCH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no validator -&amp;gt; compare content_hash after fetch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FETCH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no prior knowledge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the order. The weak-ETag downgrade is &lt;em&gt;first&lt;/em&gt;, on purpose — a record can have an ETag and still be untrustworthy, and if you check "has an ETag" before "is the ETag weak," you plan &lt;code&gt;CONDITIONAL&lt;/code&gt; on a validator that lies. Order is the bug surface here.&lt;/p&gt;

&lt;p&gt;Then the simulation that proves the savings. It does not hit a network — it models each server's outcome from a fixed rule so the count is reproducible. Unchanged + &lt;code&gt;CONDITIONAL&lt;/code&gt; → a 304 (no body). Anything &lt;code&gt;FETCH&lt;/code&gt; → a body. Only two records actually changed since last run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CHANGED_THIS_RUN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;  &lt;span class="c1"&gt;# the few records whose body really changed
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://shop.example.com/p/1002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://shop.example.com/p/1006&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;simulate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;bodies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;          &lt;span class="c1"&gt;# full bodies transferred
&lt;/span&gt;    &lt;span class="n"&gt;not_modified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;    &lt;span class="c1"&gt;# 304s — zero body
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;changed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;CHANGED_THIS_RUN&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CONDITIONAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;changed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;bodies&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;                 &lt;span class="c1"&gt;# 200 + new body
&lt;/span&gt;            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;not_modified&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;           &lt;span class="c1"&gt;# 304, zero body
&lt;/span&gt;        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                               &lt;span class="c1"&gt;# FETCH (incl. weak-ETag fallback, new urls)
&lt;/span&gt;            &lt;span class="n"&gt;bodies&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bodies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;not_modified&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The baseline it compares against is the dumb default: a full re-scrape downloads every record's body, every run. So &lt;code&gt;fetches_saved = total − bodies_transferred&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The live run
&lt;/h2&gt;

&lt;p&gt;Twelve records in scope: ten carried over from last run's manifest, two new this run. Seven have a trustworthy validator and get planned &lt;code&gt;CONDITIONAL&lt;/code&gt;. Five get &lt;code&gt;FETCH&lt;/code&gt; — two with no validator at all, one new-URL pair, and the one weak-ETag trap that got downgraded out of &lt;code&gt;CONDITIONAL&lt;/code&gt;. Run the script and you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== RE-SCRAPE PLANNER (deterministic synthetic manifest, not a real site) ===
records in scope          : 12  (10 from manifest + 2 new)
plan decided BEFORE any request:
  CONDITIONAL             : 7  (If-None-Match / If-Modified-Since)
  FETCH                   : 5
    of which weak-ETag fallback: 1  (untrustworthy validator -&amp;gt; hash-compare)
--------------------------------------------------------
simulated run outcomes:
  304 not-modified (no body): 5
  bodies transferred        : 7
--------------------------------------------------------
naive full re-scrape bodies : 12
planner bodies transferred  : 7
fetches_saved               : 5  (5/12 bodies not re-downloaded)
--------------------------------------------------------
per-record plan:
  1001  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1002  CONDITIONAL 200 (changed, body transferred) send If-None-Match / If-Modified-Since
  1003  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1004  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1005  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1006  CONDITIONAL 200 (changed, body transferred) send If-None-Match / If-Modified-Since
  1007  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1008  FETCH       200 (fetched body)           no validator -&amp;gt; compare content_hash after fetch
  1009  FETCH       200 (fetched body)           no validator -&amp;gt; compare content_hash after fetch
  1010  FETCH       200 (fetched body)           untrustworthy_validator -&amp;gt; hash-compare *TRAP*
  1011  FETCH       200 (fetched body)           new url (not in manifest)
  1012  FETCH       200 (fetched body)           new url (not in manifest)
========================================================
verdict: planned 7 conditional + 5 fetch; transferred 7 bodies vs 12 naive (5 saved).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read what that output is saying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CONDITIONAL: 7&lt;/code&gt; and &lt;code&gt;304 not-modified: 5&lt;/code&gt;. Seven records were checked with a header round-trip; five of them came back 304 — confirmed unchanged, zero body transferred. Those five are the win. The naive scraper would have downloaded all five in full.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;200 (changed, body transferred)&lt;/code&gt; on 1002 and 1006. The two records that actually changed still get their new body — a conditional GET costs nothing when the page &lt;em&gt;did&lt;/em&gt; change; you just get a normal 200. The plan never hides a real update.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1010 *TRAP*&lt;/code&gt;. The weak-ETag record did &lt;strong&gt;not&lt;/strong&gt; stay &lt;code&gt;CONDITIONAL&lt;/code&gt;. The planner downgraded it to &lt;code&gt;FETCH&lt;/code&gt; with &lt;code&gt;untrustworthy_validator -&amp;gt; hash-compare&lt;/code&gt;, so it transfers the body and compares the hash — instead of trusting a rotating ETag that would have faked a 200 forever.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fetches_saved: 5&lt;/code&gt;. Five of twelve bodies not re-downloaded, decided entirely before the first request. On a real corpus the unchanged share is usually far higher than 5/12 — but that's the n/d number I won't fabricate. The 5/12 here is what you can reproduce.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this breaks (and I'm not overselling it)
&lt;/h2&gt;

&lt;p&gt;A re-scrape planner is a work-reducer, not magic. The limits are the point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content-hash savings happen after the bytes, not before.&lt;/strong&gt; The &lt;code&gt;CONDITIONAL&lt;/code&gt; lever skips the download. The hash-compare lever only skips &lt;em&gt;parsing and downstream work&lt;/em&gt; — you still pay for the body. On a source with no validators, you cut CPU and write amplification, not bandwidth. That's a real win, but a smaller one, and worth being honest about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A moving timestamp in the body breaks naive hashing.&lt;/strong&gt; If the page embeds a "last viewed" or a server time, the body hash changes on every fetch even when the data didn't. You end up hashing noise. The workaround is to hash a normalized projection — strip the volatile fields first — which is fine until the timestamp &lt;em&gt;is&lt;/em&gt; the data you came for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weak ETags are one liar; there are others.&lt;/strong&gt; A server can return a stable strong ETag and still serve changed content (broken cache), or change content without touching Last-Modified. The downgrade catches the rotating case I hit. It does not make validators trustworthy in general — it makes you stop assuming they are.&lt;/p&gt;

&lt;p&gt;So treat the planner as what it is: a cheap way to turn "re-download everything" into "touch what plausibly changed, confirm the rest with a header." It won't make a re-scrape free. It makes the dumb default stop being the default.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do Monday
&lt;/h2&gt;

&lt;p&gt;Three moves, smallest first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Persist a manifest.&lt;/strong&gt; One row per record: &lt;code&gt;url&lt;/code&gt;, the ETag and Last-Modified the server gave you, and a hash of the body you stored. It's a tiny JSON or SQLite table. If you don't keep it, every run starts blind and a full re-scrape is your &lt;em&gt;only&lt;/em&gt; option — the manifest is what makes a plan possible at all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Send conditional requests on the re-run path.&lt;/strong&gt; &lt;code&gt;If-None-Match&lt;/code&gt; from the stored ETag, &lt;code&gt;If-Modified-Since&lt;/code&gt; from the stored Last-Modified. Honor a 304 as "unchanged, skip." This is in &lt;code&gt;requests&lt;/code&gt; and &lt;code&gt;httpx&lt;/code&gt; today; it's a few lines, and it's the one change that pays back the most.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distrust weak and rotating validators.&lt;/strong&gt; If an ETag starts with &lt;code&gt;W/&lt;/code&gt;, or you see it change across two fetches of an unchanged page, downgrade that source to fetch-and-hash. Log the downgrade so you know which sources you can't precondition — a plan that knows what it can't trust beats one that trusts a liar.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need a crawl framework rewrite. You need to stop deciding what to fetch after you've already fetched it.&lt;/p&gt;




&lt;p&gt;One thing I haven't solved cleanly: how do you decide a record changed when the source gives you no validator &lt;em&gt;and&lt;/em&gt; the body carries a moving timestamp inside it? I hash with the timestamp stripped — but that breaks the moment the timestamp &lt;em&gt;is&lt;/em&gt; the data I came to collect, and I don't have a general rule for telling those two cases apart automatically. If you've got a heuristic that holds up in production, I want to hear it. I read every comment.&lt;/p&gt;

&lt;p&gt;Follow for the next numbers from the run log. And tell me: what's the worst re-scrape waste you've found in your own pipeline — the job that was re-downloading the most for the least?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Aleksei Spinov — I run production scrapers (2,190 runs across 32 actors; one Trustpilot scraper at 962). Proof: &lt;a href="https://blog.spinov.online" rel="noopener noreferrer"&gt;blog.spinov.online&lt;/a&gt; and my &lt;a href="https://apify.com/knotless_cadence" rel="noopener noreferrer"&gt;Apify profile&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI disclosure: drafted with AI assistance, then edited, fact-checked, and the code run and verified by me. The manifest is synthetic and deterministic; the output above is real stdout from executing the script.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>performance</category>
    </item>
    <item>
      <title>Your AI Agent Is Paying for HTML It Never Reads — I Measured the 7x Token Tax</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Tue, 09 Jun 2026 07:58:48 +0000</pubDate>
      <link>https://dev.to/0012303/your-ai-agent-is-paying-for-html-it-never-reads-i-measured-the-7x-token-tax-4cf8</link>
      <guid>https://dev.to/0012303/your-ai-agent-is-paying-for-html-it-never-reads-i-measured-the-7x-token-tax-4cf8</guid>
      <description>&lt;p&gt;I gave an agent a &lt;code&gt;fetch_page&lt;/code&gt; tool, asked it to read one Wikipedia article, and watched that single page cost &lt;strong&gt;48,703 tokens&lt;/strong&gt; before the model produced a word. The readable text on that page is about 7,300 tokens. I was paying for ~41,000 tokens of &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt;, inline CSS, and analytics scripts that never help the model answer anything.&lt;/p&gt;

&lt;p&gt;That's the token tax on agent web access, and almost nobody measures it. Here's the number, the 40-line fix, and the honest part — where it doesn't matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;p&gt;When your agent "reads a page", it usually gets raw HTML pasted into the prompt. On three pages I tested, &lt;strong&gt;85–86% of those tokens were markup the model doesn't need to read for meaning.&lt;/strong&gt; Strip the page to text first and the token bill drops ~7×. The fix is the standard library plus a tokenizer — no API, no paid service.&lt;/p&gt;

&lt;h2&gt;
  
  
  The measurement
&lt;/h2&gt;

&lt;p&gt;Counted with &lt;code&gt;o200k_base&lt;/code&gt; (the tokenizer GPT-4o uses), three live pages of different sizes, raw HTML vs text-only. Measured 2026-06-09 — these are live pages, so your exact numbers will differ:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Page&lt;/th&gt;
&lt;th&gt;Raw HTML&lt;/th&gt;
&lt;th&gt;Text for the agent&lt;/th&gt;
&lt;th&gt;Reduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Wikipedia: Web scraping (165 KB)&lt;/td&gt;
&lt;td&gt;48,703 tok&lt;/td&gt;
&lt;td&gt;7,280 tok&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;6.7×&lt;/strong&gt; (85% less)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wikipedia: Large language model (686 KB)&lt;/td&gt;
&lt;td&gt;221,622 tok&lt;/td&gt;
&lt;td&gt;30,988 tok&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;7.2×&lt;/strong&gt; (86% less)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;example.com (528 B, control)&lt;/td&gt;
&lt;td&gt;152 tok&lt;/td&gt;
&lt;td&gt;22 tok&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;6.9×&lt;/strong&gt; (86% less)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three pages is a small sample, not a benchmark. But the ratio barely moved between a 528-byte page and a 686 KB one, which is the interesting part: the markup overhead is roughly proportional, so on the pages I tested the tax shows up everywhere, not just on the big ones.&lt;/p&gt;

&lt;p&gt;At GPT-4o input pricing ($2.50 / 1M tokens, OpenAI, checked June 2026), the LLM page alone is &lt;strong&gt;$0.55 raw vs $0.078 clean&lt;/strong&gt; per read. One read. An agent that crawls 200 pages in a loop turns that into real money — and worse, it fills the context window with noise that pushes out the tokens you actually want the model reasoning over.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix (stdlib + tiktoken)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ssl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;html.parser&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HTMLParser&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;

&lt;span class="n"&gt;UA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/148.0.0.0 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;SKIP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;script&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;style&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;head&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;noscript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;svg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;template&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TextOnly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HTMLParser&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skipping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_starttag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;SKIP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skipping&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_endtag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;SKIP&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skipping&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skipping&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;skipping&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UA&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TextOnly&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;feed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;enc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encoding_for_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# o200k_base
&lt;/span&gt;&lt;span class="n"&gt;raw_tok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clean_tok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;clean_tok&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;raw_tok&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;        &lt;span class="c1"&gt;# prove the work was done
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_tok&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clean_tok&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens  (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_tok&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;clean_tok&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x less)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a &lt;code&gt;runnable local&lt;/code&gt; excerpt. &lt;code&gt;pip install -U tiktoken&lt;/code&gt; (you need a recent version for &lt;code&gt;o200k_base&lt;/code&gt;), then &lt;code&gt;python clean.py https://en.wikipedia.org/wiki/Web_scraping&lt;/code&gt;. Output on my machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;48,703 -&amp;gt; 7,280 tokens  (6.7x less)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full script (with the TLS handling and sanity check below) is in the repo. It's the standard library doing the work &lt;code&gt;HTMLParser&lt;/code&gt; was built for, plus &lt;code&gt;tiktoken&lt;/code&gt; so you count in the model's units, not characters. No &lt;code&gt;requests&lt;/code&gt;, no readability library, no service.&lt;/p&gt;

&lt;p&gt;One honest detail from the sanity print: the extracted text starts with &lt;code&gt;Jump to content Main menu Main menu ... Navigation&lt;/code&gt;. This is a sanitizer, not a main-content reader — it keeps nav and footer text (more on that below).&lt;/p&gt;

&lt;h2&gt;
  
  
  One gotcha that cost me ten minutes
&lt;/h2&gt;

&lt;p&gt;I run behind a VPN, and the first fetch died with &lt;code&gt;CERTIFICATE_VERIFY_FAILED&lt;/code&gt;. The VPN was intercepting TLS, so the system couldn't chain to a trusted root. urllib hides this: it wraps the &lt;code&gt;ssl.SSLError&lt;/code&gt; inside a &lt;code&gt;urllib.error.URLError&lt;/code&gt;, so a naive &lt;code&gt;except ssl.SSLError&lt;/code&gt; never fires. You catch &lt;code&gt;URLError&lt;/code&gt; and look at &lt;code&gt;e.reason&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;urllib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;URLError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ssl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SSLError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TLS failed. If you trust this proxy, re-run with --insecure.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script &lt;strong&gt;fails closed&lt;/strong&gt; — it won't silently disable verification. Measuring a page over an untrusted MITM proxy is meaningless (you'd be tokenizing whatever the proxy injected), so turning off TLS is an explicit &lt;code&gt;--insecure&lt;/code&gt; flag, not a quiet fallback.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this is NOT worth it
&lt;/h2&gt;

&lt;p&gt;I'd be lying if I said "always strip HTML." It isn't free:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You lose structure.&lt;/strong&gt; Tables, link targets, &lt;code&gt;alt&lt;/code&gt;/&lt;code&gt;title&lt;/code&gt;, and &lt;code&gt;&amp;lt;code&amp;gt;&lt;/code&gt; boundaries flatten into text. If the agent's job is "extract every row of this table" or "follow these links," text-only throws away the signal. Hand it Markdown for those.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;html.parser&lt;/code&gt; is not a browser.&lt;/strong&gt; JS-rendered pages return a near-empty shell — this strips what the server sent, not what a browser paints. SPA targets still need a headless browser first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's a sanitizer, not a reader.&lt;/strong&gt; It keeps menu and footer text (that &lt;code&gt;Jump to content / Main menu&lt;/code&gt; above). A readability pass cuts further, at the cost of a dependency and occasionally eating real content. For an agent, "too much text" is cheap; "silently dropped the answer" is expensive — so I over-keep.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The 7× is against raw HTML&lt;/strong&gt;, the naive default — not against Markdown or a readability pass, which also cut tokens. If you already feed Markdown, your win is smaller.&lt;/li&gt;
&lt;li&gt;Numbers depend on the live page, your User-Agent, and locale. Re-measure on your own targets.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the honest rule: strip to text when the agent is &lt;strong&gt;reading for meaning&lt;/strong&gt; (RAG ingestion, summarization, Q&amp;amp;A). Keep structure when it's &lt;strong&gt;extracting specific fields&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I bothered
&lt;/h2&gt;

&lt;p&gt;I run production scrapers, and the lesson that transfers to agents is the same one that bit me on data pipelines: the cost isn't in the request, it's in what you carry forward. An agent that pastes raw HTML into every step pays the tax on every step, and the context bloat quietly degrades the reasoning you're paying for twice.&lt;/p&gt;

&lt;p&gt;40 lines. One &lt;code&gt;pip install&lt;/code&gt;. ~7× fewer tokens on the pages I tested.&lt;/p&gt;

&lt;p&gt;How are you feeding pages to your agents right now — raw HTML, Markdown, or a readability pass? And has anyone measured the token difference on their own targets? Drop your numbers 👇&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>webscraping</category>
      <category>llm</category>
    </item>
    <item>
      <title>Your Scraper Got Clean Data. The Site Lied to It.</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Mon, 08 Jun 2026 18:14:03 +0000</pubDate>
      <link>https://dev.to/0012303/your-scraper-got-clean-data-the-site-lied-to-it-427k</link>
      <guid>https://dev.to/0012303/your-scraper-got-clean-data-the-site-lied-to-it-427k</guid>
      <description>&lt;p&gt;Your scraper ran clean. HTTP 200 on every request. The schema validated. Every price sat in a sane range, every date was in the past, every ISBN had thirteen digits. Zero errors, zero retries. You shipped the dataset.&lt;/p&gt;

&lt;p&gt;And every value in it was a lie the site fed you on purpose, because it knew you were a bot.&lt;/p&gt;

&lt;p&gt;That's the failure I want to talk about. Not a block. Not a captcha. Not a crash. The site looked at your traffic, decided not to fight you, and instead handed back a 200 full of plausible garbage — values engineered to pass every check you have. This is the one corner of data quality where &lt;em&gt;valid&lt;/em&gt; and &lt;em&gt;true&lt;/em&gt; come apart, and almost every detector people write only measures the first one.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A site that detects a scraper can serve a 200 with a flawless schema and values that pass every sanity rule — and are deliberately fabricated. This is documented, shipping anti-bot behavior, not a hypothetical.&lt;/li&gt;
&lt;li&gt;Status codes and sanity checks can't catch it. They answer "is my pipeline correct?" The poisoned row's question is "is the source telling the truth?" No range check answers that.&lt;/li&gt;
&lt;li&gt;The fix is grounding: check each row against an &lt;em&gt;independent invariant&lt;/em&gt; the source can't fake by making the value look plausible — an ISBN-13 checksum, &lt;code&gt;price * qty == line_total&lt;/code&gt;, a real second origin.&lt;/li&gt;
&lt;li&gt;The trap I didn't expect: naive cross-source consensus gets fooled too. "Three sources agree" means nothing if all three are mirrors of one poisoned first-party page. Independence is the signal, not the vote count.&lt;/li&gt;
&lt;li&gt;The numbers in the demo below are a deterministic synthetic dataset, not a measurement of any real site. What's real is the volume that earns me the right to talk about plausible-but-false data: 2,190 production runs across 32 actors, one Trustpilot scraper at 962.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why I get to talk about plausible lies
&lt;/h2&gt;

&lt;p&gt;I run production scrapers. Thirty-two published actors, 2,190 runs logged in production as of this week (my own Apify dashboard — that's the live counter, not a rounded brag). One of them, a Trustpilot review scraper, has run 962 times against the same kind of source.&lt;/p&gt;

&lt;p&gt;That last one matters more than the total, and here's why. Reviews are the textbook class of data where &lt;em&gt;plausible&lt;/em&gt; and &lt;em&gt;true&lt;/em&gt; are different problems. A fake review and a real review look identical at the schema level: both have a star rating in 1–5, a date, a body in fluent English, an author handle. Sanity passes on both. The entire job of scraping that surface well is knowing that a clean, well-formed, perfectly typed record can still be fabricated. After 962 runs hitting that wall, "the value looks right and is still false" isn't a thought experiment to me. It's the default assumption.&lt;/p&gt;

&lt;p&gt;Now let me be honest about what I don't have. I do not have a clean, published figure for how many poisoned rows we've actually caught in the wild, or what share of any specific site's responses are fake. That number is n/d, and I'm not going to invent it. The 2,190 / 962 is the part that's real — it's the exposure that makes the failure class familiar. The detection mechanism below, I'll show you on a dataset you can run yourself in two seconds, so nothing rests on a number you can't check.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is not the other failures in the series
&lt;/h2&gt;

&lt;p&gt;If you've read the rest of this series, draw the boundary hard, because these failures rhyme and the fixes don't.&lt;/p&gt;

&lt;p&gt;This is not a broken status code. &lt;a href="https://blog.spinov.online/blog/http-200-is-a-lie-schema-canary/" rel="noopener noreferrer"&gt;HTTP 200 lying about the &lt;em&gt;shape&lt;/em&gt; of a response is the schema canary&lt;/a&gt; — that catches a 200 whose structure has drifted. Here the shape is flawless. The canary would report HEALTHY.&lt;/p&gt;

&lt;p&gt;This is not &lt;a href="https://blog.spinov.online/blog/your-scraper-returned-a-clean-row-it-was-wrong/" rel="noopener noreferrer"&gt;a field that violates a sanity rule&lt;/a&gt; — a price of $0, a date in the future, a language that doesn't match the country. That post catches values that look &lt;em&gt;wrong&lt;/em&gt;. This is the opposite: every field looks &lt;em&gt;right&lt;/em&gt;, passes every one of those sanity checks, and the value is still a deliberate lie the source served &lt;em&gt;because&lt;/em&gt; it made you as a bot.&lt;/p&gt;

&lt;p&gt;It isn't &lt;a href="https://blog.spinov.online/blog/your-scraper-died-at-row-12000/" rel="noopener noreferrer"&gt;a crash you resume from&lt;/a&gt; — there's no crash, the job exits 0. The question those posts answer is "is my pipeline correct?" This one asks something none of them touch: "is the source telling the truth?" And no status code, no schema validator, no range check will ever answer that.&lt;/p&gt;

&lt;p&gt;New axis: &lt;strong&gt;trust&lt;/strong&gt;. Not the shape of a response, not a field's plausibility, not a crash. Whether the values themselves are real.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is shipping, not a thought experiment
&lt;/h2&gt;

&lt;p&gt;The reason this isn't paranoia: the anti-bot industry already does it on purpose. The polite framing is "tarpit" or "decoy content" — when a site fingerprints a crawler, instead of returning a 403 it returns a 200 full of generated content that wastes your time and pollutes your dataset. Cloudflare shipped a feature in 2025 that feeds suspected AI crawlers a maze of AI-generated decoy pages on purpose; their own writeup frames it as serving believable-but-irrelevant data instead of blocking (&lt;a href="https://blog.cloudflare.com/ai-labyrinth/" rel="noopener noreferrer"&gt;Cloudflare, AI Labyrinth, 2025&lt;/a&gt;). Bruce Schneier has been cataloguing the broader version — deliberate data poisoning aimed at scrapers and the models trained on them (&lt;a href="https://www.schneier.com/blog/archives/2025/03/ai-data-poisoning.html" rel="noopener noreferrer"&gt;Schneier on Security, 2025&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;So the threat model flipped. The old story was "they'll block you." The new story is "they'll let you in and lie to you," because a poisoned dataset that you trust is worth more to them than a blocked request you'll just retry from another IP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why status and sanity are silent by design
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable part. The poisoned row passes your checks not because your checks are bad, but because they were built to answer a different question.&lt;/p&gt;

&lt;p&gt;A status code answers: did the transport succeed? 200 — yes. A schema canary answers: is the structure intact? All keys present, types correct, no nulls — yes. A sanity rule answers: is this value inside the range a real value would fall in? Price is $44.99, qty is 1, date is last month — yes, yes, yes.&lt;/p&gt;

&lt;p&gt;Every one of those is a question about &lt;em&gt;form&lt;/em&gt;. None of them is a question about &lt;em&gt;correspondence to reality&lt;/em&gt;. And an adversary who controls the response can satisfy every form check trivially — that's the whole point of serving a decoy instead of a block. They're not sending you malformed junk that trips a validator. They're sending you a beautifully formed lie.&lt;/p&gt;

&lt;p&gt;Sanity catches accidents: the source glitched, a field came back null, a scraper bug doubled a value. It does not catch adversarial fabrication, because fabrication is &lt;em&gt;designed&lt;/em&gt; to look sane. Validity is not truth. You can't range-check your way to trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually catches it: grounding to an invariant
&lt;/h2&gt;

&lt;p&gt;So if you can't trust the value, what &lt;em&gt;can&lt;/em&gt; you trust? An invariant the source can't satisfy just by making the number look plausible — something that has to be &lt;em&gt;computed against an independent reference&lt;/em&gt; and will break if the value was made up.&lt;/p&gt;

&lt;p&gt;Three kinds are cheap and shockingly effective:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A self-checking identifier.&lt;/strong&gt; An ISBN-13 isn't just thirteen digits. Its last digit is a checksum over the first twelve. A fabricated ISBN that "looks real" almost never satisfies the checksum.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A cross-field arithmetic invariant.&lt;/strong&gt; &lt;code&gt;price * qty == line_total&lt;/code&gt;. Each field can be individually plausible and the math can still be incoherent — which is exactly what happens when values get swapped around.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real independent corroboration.&lt;/strong&gt; Not "how many sources cite this," but "how many &lt;em&gt;distinct origins&lt;/em&gt; do." Three citations that all resolve to the same first-party domain are one source wearing three hats.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's the whole probe. Pure stdlib, no network, no browser, no keys, no &lt;code&gt;random&lt;/code&gt; — a deterministic synthetic dataset stands in for the collected rows so you get the exact output I did. The "collection" is a hardcoded list because the mechanism — grounding to an invariant — doesn't depend on the transport one bit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;isbn13_checksum_ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;isbn13&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;An ISBN-13 is valid only if sum(d_i * w_i) % 10 == 0, weights alternating
    1, 3. Sanity asks &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is it 13 digits&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. This asks &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is it a REAL ISBN&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;isbn13&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;arithmetic_invariant_ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;price * qty must equal line_total. Each field can be individually plausible
    and still violate this — incoherent values don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t survive the cross-field math.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;line_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;tol&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;registrable_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Crude eTLD+1: last two labels. &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m.vendor-a.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;vendor-a.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.
    In production use the public suffix list.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;independent_corroboration_ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_independent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""'&lt;/span&gt;&lt;span class="s"&gt;3 sources agree&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; means nothing if all 3 are mirrors of one origin.
    Count DISTINCT registrable domains, not citations.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;distinct&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;registrable_domain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sources_citing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;distinct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_independent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each row gets graded &lt;code&gt;TRUSTED&lt;/code&gt; or &lt;code&gt;POISONED(reason)&lt;/code&gt;. Note what the checks do &lt;em&gt;not&lt;/em&gt; need: they never need to know which field is the lie, or what the "correct" value was. They only need the invariant to hold. That's the property that makes grounding work where sanity can't — it doesn't model the truth, it models a &lt;em&gt;constraint&lt;/em&gt; the truth obeys and a lie usually breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The twist: naive consensus gets fooled too
&lt;/h2&gt;

&lt;p&gt;The obvious upgrade, the one most people reach for, is consensus. "Don't trust one source. Cross-check three. If they agree, it's true." It feels bulletproof.&lt;/p&gt;

&lt;p&gt;It isn't, and this is the part I want you to take away even if you forget the rest. I ran a row into the probe that &lt;em&gt;three sources corroborate&lt;/em&gt; — and it's still poisoned. Because all three "sources" resolve to the same registrable domain: &lt;code&gt;store.vendor-a.com&lt;/code&gt;, &lt;code&gt;m.vendor-a.com&lt;/code&gt;, &lt;code&gt;cdn.vendor-a.com&lt;/code&gt;. One poisoned first-party page, mirrored across a store front, a mobile host, and a CDN. A naive consensus check counts three agreements and votes confidently &lt;em&gt;for the lie&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Consensus measures agreement. Agreement is not independence. If everyone is quoting the same poisoned origin, unanimity is exactly what you'd expect — and exactly what you should distrust. The fix is to count distinct &lt;em&gt;origins&lt;/em&gt;, not distinct &lt;em&gt;URLs&lt;/em&gt;: collapse each citation to its registrable domain and require at least two that don't trace back to the same place. That's the difference between "three sources said so" and "three independent sources said so," and adversaries live in the gap between those two sentences.&lt;/p&gt;

&lt;h2&gt;
  
  
  The live run
&lt;/h2&gt;

&lt;p&gt;Twelve "collected" rows. All twelve pass the schema canary (#5) and all twelve pass field sanity (#7) — by construction, so the old detectors stay silent. Three are poisoned, each caught by exactly one grounding rule. Run the script and you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== POISON CHECK (deterministic synthetic dataset, not a measurement of a real site) ===
records collected        : 12
passed schema canary (#5): 12   &amp;lt;- shape is perfect on ALL
passed field sanity (#7) : 12   &amp;lt;- every value looks RIGHT; sanity is silent
--------------------------------------------------------
grounding checks (truth, not validity):
  ISBN-13 checksum        : 1 failed
  price*qty == line_total : 1 failed
  independent corroborat. : 1 failed (3 "sources" all point to one first-party)
--------------------------------------------------------
TRUSTED                  : 9
POISONED                 : 3
  - sku DEMO-0007  reason=isbn13_checksum
  - sku DEMO-0009  reason=arithmetic_invariant
  - sku DEMO-0011  reason=false_consensus_single_origin
========================================================
verdict: 3 clean, well-formed, sanity-passing rows are fabricated.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read what that output is actually saying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;passed schema canary 12&lt;/code&gt; and &lt;code&gt;passed field sanity 12&lt;/code&gt;. All twelve clear every check from the earlier posts in this series. The schema canary and the sanity validator are structurally blind here — not broken, just answering a different question.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ISBN-13 checksum: 1 failed&lt;/code&gt;. DEMO-0007 has a thirteen-digit ISBN with a valid 978 prefix. Sanity loves it. The check digit doesn't satisfy the weighted-mod-10 rule, so the number was made up.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;price*qty == line_total: 1 failed&lt;/code&gt;. DEMO-0009 has a sane price, a positive integer qty, and a positive total — each field passes on its own. &lt;code&gt;34.99 * 3&lt;/code&gt; is &lt;code&gt;104.97&lt;/code&gt;, not the &lt;code&gt;79.99&lt;/code&gt; in the row. The values are individually plausible and jointly incoherent.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;independent corroborat: 1 failed&lt;/code&gt;. DEMO-0011 is the consensus trap: three citations, one origin. The vote says trust it. Independence says don't.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;verdict: 3 ... fabricated&lt;/code&gt; among nine clean rows. The poison is invisible to status, shape, and sanity, and visible only to grounding.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this breaks (and I'm not going to oversell it)
&lt;/h2&gt;

&lt;p&gt;Grounding is not a lie detector. It's narrower than that, and the limits matter more than the wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need an invariant, and not all data has one.&lt;/strong&gt; A book has an ISBN checksum. A line item has arithmetic. A free-text Trustpilot review has &lt;em&gt;neither&lt;/em&gt;. There is no checksum on "the food was cold and the staff were rude." For pure text with no internal constraint and no independent anchor, grounding has nothing to grab — and that's exactly the surface where poisoning is easiest. I can't hand you a 30-line probe for that. I'd be lying if I said I could.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-source corroboration needs sources that are genuinely separate.&lt;/strong&gt; My domain collapse catches the lazy mirror case. It does not catch a determined adversary who plants the same lie across genuinely distinct domains, or a real ecosystem where everyone honestly syndicates from one upstream feed. Independence is a spectrum, and registrable-domain is a crude proxy for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A perfectly consistent fabrication beats this.&lt;/strong&gt; Grounding catches &lt;em&gt;incoherent&lt;/em&gt; lies — a value that breaks a constraint. If the adversary fabricates the ISBN &lt;em&gt;and&lt;/em&gt; recomputes a valid checksum, &lt;em&gt;and&lt;/em&gt; makes the arithmetic close, the invariant holds and the probe says TRUSTED. This raises the cost of the lie, which is the realistic goal. It does not make lying impossible.&lt;/p&gt;

&lt;p&gt;So treat it as what it is: a cheap filter that catches the common, lazy poisoning and forces an adversary to do real, coordinated work to get past it. That's a good trade for 30 lines. It is not trust by itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do Monday
&lt;/h2&gt;

&lt;p&gt;Three moves, smallest first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add every invariant your data already carries.&lt;/strong&gt; Checksummed identifiers (ISBN, IBAN, EAN, VAT numbers), cross-field arithmetic (&lt;code&gt;price * qty == total&lt;/code&gt;, percentages summing to 100), referential constraints (a foreign key that resolves). These are free — the constraint already exists in the domain; you just have to check it. Most pipelines never do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Count distinct origins, not citations.&lt;/strong&gt; If you corroborate across sources, collapse each to its registrable domain before you count. "Three sources" that share a domain is one source. The public suffix list does this properly; the two-label hack in the demo is the starter version.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Know which of your columns have no anchor, and flag them.&lt;/strong&gt; The honest output isn't only "this row is poisoned." It's also "this field has no invariant I can ground it against, so I can't vouch for it." A pipeline that knows what it can't verify is worth more than one that pretends everything is fine.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need a fraud-detection platform for the first pass. You need to use the constraints your own data already ships with — most of us collect them and then never check them.&lt;/p&gt;




&lt;p&gt;One thing I haven't solved: how do you ground free text at all? Reviews, descriptions, comments — the highest-value, most-poisoned surface — have no checksum and often no independent anchor. The only handles I've found are weak: distribution drift across a corpus, stylometric oddities, timing clusters. None of them work on a single row, and all of them are gameable. If you've actually caught a fabricated &lt;em&gt;free-text&lt;/em&gt; record in production — not a malformed one, a fabricated one — I want to know what signal you used. I read every comment.&lt;/p&gt;

&lt;p&gt;Follow for the next numbers from the run log. And tell me: what's the most convincing fake data a source ever fed your scraper — the one that passed every check you had?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Aleksei Spinov — I run production scrapers (2,190 runs across 32 actors; one Trustpilot scraper at 962). Proof: &lt;a href="https://blog.spinov.online" rel="noopener noreferrer"&gt;blog.spinov.online&lt;/a&gt; and my &lt;a href="https://apify.com/knotless_cadence" rel="noopener noreferrer"&gt;Apify profile&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI disclosure: drafted with AI assistance, then edited, fact-checked, and the code run and verified by me. The dataset is synthetic and deterministic; the output above is real stdout from executing the script.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>dataquality</category>
    </item>
    <item>
      <title>Your Scraper Passes Every Run. It's Still Rotting.</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Sun, 07 Jun 2026 17:44:25 +0000</pubDate>
      <link>https://dev.to/0012303/your-scraper-passes-every-run-its-still-rotting-1ikj</link>
      <guid>https://dev.to/0012303/your-scraper-passes-every-run-its-still-rotting-1ikj</guid>
      <description>&lt;p&gt;A scraper run finished green. Exit 0. Schema valid. Row count looked normal. So did the one before it, and the forty before that.&lt;/p&gt;

&lt;p&gt;Then one afternoon you glance at a number you don't usually look at — total rows this month vs the same source last month — and you're collecting noticeably less than you used to. No errors. No traceback. No alert fired, because nothing was ever wrong with any single run.&lt;/p&gt;

&lt;p&gt;That's the failure I want to talk about. Not a crash. A slow rot, measured across runs, that every single-run check on earth is blind to.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A scraper can pass &lt;em&gt;every&lt;/em&gt; per-run gate (exit 0, schema ok, count plausible) while its rolling yield slides down for weeks.&lt;/li&gt;
&lt;li&gt;Single-run checks can't see it. The signal only exists when you compare today's run to your &lt;em&gt;own&lt;/em&gt; past, not to a declared total.&lt;/li&gt;
&lt;li&gt;The obvious detector — "median of the last K runs" — is a boiling-frog trap. On a slow drift the baseline sinks &lt;em&gt;with&lt;/em&gt; the signal, so it never fires. I ran it. Zero warnings while yield dropped 25%.&lt;/li&gt;
&lt;li&gt;Fix: a &lt;strong&gt;lagged&lt;/strong&gt; baseline. Compare today against the median of runs from K..2K runs ago — your settled past, not the part already eaten by decay. ~20 lines over your run log. No Grafana, no SRE.&lt;/li&gt;
&lt;li&gt;Numbers below are a deterministic synthetic run log, not a claim about our real slide. What's real is the volume that makes such a curve observable: 2,190 production runs, one Trustpilot scraper alone at 962.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why I trust this surface exists
&lt;/h2&gt;

&lt;p&gt;I've shipped 32 scrapers and they've logged 2,190 runs in production. One of them, a Trustpilot review scraper, has run 962 times against the same source. That's the thing most scraping tutorials don't have: a &lt;em&gt;long line&lt;/em&gt; of runs hitting one place over real calendar time.&lt;/p&gt;

&lt;p&gt;When you have 962 runs of one source, "yield per run" stops being a single number and becomes a curve. And a curve has a shape. Most of the time the shape is flat and boring. Sometimes it tilts down so gently that no individual run ever looks off — and that's exactly the case nobody writes about, because you only see it if you have the history.&lt;/p&gt;

&lt;p&gt;To be honest about the limits up front: I don't have a clean, published figure for how far our &lt;em&gt;real&lt;/em&gt; yield slid on any specific site, so I'm not going to invent one. That number is n/d. What I can do is show you the mechanism on a deterministic run log you can execute yourself, and tell you the detector I'd reach for. The 2,190 / 962 is the part that's real — it's the volume that makes the curve visible at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is not the other failures
&lt;/h2&gt;

&lt;p&gt;If you've read the rest of this series, draw the boundary clearly, because the failures rhyme and the fixes don't.&lt;/p&gt;

&lt;p&gt;This isn't a bad status code — &lt;a href="https://blog.spinov.online/blog/http-200-is-a-lie-schema-canary/" rel="noopener noreferrer"&gt;HTTP 200 lying about a broken response shape is the schema canary&lt;/a&gt;. This isn't a &lt;a href="https://blog.spinov.online/blog/your-scraper-returned-a-clean-row-it-was-wrong/" rel="noopener noreferrer"&gt;wrong field value inside a row that's otherwise valid&lt;/a&gt;. It isn't &lt;a href="https://blog.spinov.online/blog/you-pay-for-the-bandwidth-that-returns-nothing/" rel="noopener noreferrer"&gt;bytes you paid for that came back empty&lt;/a&gt;, and it isn't &lt;a href="https://blog.spinov.online/blog/your-scraper-died-at-row-12000/" rel="noopener noreferrer"&gt;a crash you resume from&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Every run here is green. Exit 0, schema valid, row count plausible &lt;em&gt;for a single run&lt;/em&gt;. There is no declared total to check against — you don't know how many rows "should" come back. The decay only exists when you line today's run up against your own past. The rolling yield has been drifting down for weeks, and nothing ever threw.&lt;/p&gt;

&lt;p&gt;New axis: &lt;strong&gt;time&lt;/strong&gt;. Not the shape of one response, not a field, not bytes, not a crash. The trend of one source against its own history.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a healthy run can be a rotting series
&lt;/h2&gt;

&lt;p&gt;Here's the mechanism, because it's quieter than it sounds.&lt;/p&gt;

&lt;p&gt;Say your scraper walks 80 pages of a source each run. When the source is healthy, each page hands back about 48 rows — call it a full page minus a little. So a run pulls roughly 3,800 rows. Plausible. Your per-run sanity gate says "fail if rows &amp;lt; 3,000," and it never trips.&lt;/p&gt;

&lt;p&gt;Now the source starts thinning. Maybe a soft rate-throttle kicks in and pages return fewer items. Maybe the result set genuinely shrinks. Maybe an A/B test on their side trims the page size. Whatever the cause, each run quietly returns a hair less than the last — say 0.9 fewer rows per page. One run that goes from 48 to 47.1 rows/page looks identical to the one before. Nobody blinks.&lt;/p&gt;

&lt;p&gt;Roll that forward. Twenty runs later you're at 30 rows/page. The run still walks 80 pages. Still exits 0. Still has a valid schema on every row. Still clears &lt;code&gt;rows &amp;lt; 3000&lt;/code&gt; — barely. Your per-run gate has no memory, so it can't tell that 2,400 rows used to be 3,800. The frog has been boiling the whole time and the thermometer only ever read "alive."&lt;/p&gt;

&lt;p&gt;That's the trap of single-run validation: every check answers "is this run OK &lt;em&gt;by itself&lt;/em&gt;?" None of them answers "is this run OK &lt;em&gt;compared to what this source used to give me&lt;/em&gt;?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The detector: ~20 lines over your run log
&lt;/h2&gt;

&lt;p&gt;You don't need a metrics stack for this. You need three things: log the yield of every run, keep the log, and run a baseline check over it.&lt;/p&gt;

&lt;p&gt;Here's the whole probe. Pure stdlib, no network, no browser, no paid API — a deterministic synthetic run log stands in for the source so you can run it in seconds and get the exact output I did.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;

&lt;span class="c1"&gt;# --- knobs ---
&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;              &lt;span class="c1"&gt;# baseline window size (median of K runs)
&lt;/span&gt;&lt;span class="n"&gt;GAP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt;            &lt;span class="c1"&gt;# how far back the window sits (LAGGED, not trailing)
&lt;/span&gt;&lt;span class="n"&gt;THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;   &lt;span class="c1"&gt;# WARN if today's yield is &amp;gt;15% below the lagged baseline
&lt;/span&gt;&lt;span class="n"&gt;MIN_HISTORY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;GAP&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;synth_run_log&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Deterministic log of one source over 60 runs. Every run is &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;green&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:
    exit 0, schema valid, plausible single-run row count. Yield slowly decays
    after run 40 (a soft throttle / thinning source) with small jitter.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;runs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;base_yield&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;48.0&lt;/span&gt;
    &lt;span class="n"&gt;jitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;decay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="nf"&gt;else &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;
        &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_yield&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;decay&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;jitter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jitter&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="n"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exit_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema_ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yield_per_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;yield_decay_probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GAP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;For each run, compare its yield to the median of a LAGGED window:
    runs [idx-k-gap : idx-gap]. The gap is what defeats the boiling-frog trap.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;first_warn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;MIN_HISTORY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BUILDING_BASELINE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;drop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yield_per_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
        &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;drop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yield_per_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt;
        &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;drop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DECAY_WARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;drop&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;drop&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;first_warn&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;first_warn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;first_warn&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The heart of it is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yield_per_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Today's yield isn't compared to the last 7 runs. It's compared to 7 runs that ended &lt;em&gt;14 to 8 runs ago&lt;/em&gt; — your settled past. I'll explain in a second why that gap is the whole game.&lt;/p&gt;

&lt;p&gt;Run it and you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== YIELD DECAY PROBE ===
runs in log              : 60
every run exit 0         : True
every run schema ok      : True
baseline window (K)      : 7 runs, lagged by 7 (settled past)
warn threshold           : 15% below baseline
min single-run rows      : 2384
max single-run rows      : 3880
--------------------------------------------------------
FIRST DECAY WARN         : run 48
  baseline yield/page    : 47.8
  this run  yield/page   : 40.3
  drop vs baseline       : 15%
  this run exit code     : 0  (GREEN)
  this run rows          : 3224  (plausible)
--------------------------------------------------------
latest run (run 60)         : yield 29.8/page, rows 2384, exit 0
latest verdict           : DECAY_WARN (drop 25% vs baseline 40.2)
========================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read what that output is actually saying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;every run exit 0 = True&lt;/code&gt;, &lt;code&gt;every run schema ok = True&lt;/code&gt;. All 60 runs pass every single-run check. There is nothing here a status code or schema validator would catch.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;min single-run rows : 2384&lt;/code&gt;. Even the worst run pulled 2,384 rows. A &lt;code&gt;rows &amp;lt; 3000&lt;/code&gt; gate would have &lt;em&gt;passed&lt;/em&gt; the early decay and only barely caught the late stuff. A &lt;code&gt;rows &amp;lt; 2000&lt;/code&gt; gate never trips at all.&lt;/li&gt;
&lt;li&gt;The decay starts at run 41. The probe's &lt;strong&gt;first warning fires at run 48&lt;/strong&gt; — about halfway down, while the run is fully green: exit 0, 3,224 rows, perfectly plausible. You get the flag weeks before this becomes "we're collecting half of what we used to."&lt;/li&gt;
&lt;li&gt;By run 60, yield is down 25% and rows have slid to 2,384. Without the probe, that's still a green run. With it, you'd have known seven runs in.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The part that surprised me: the obvious detector is silent
&lt;/h2&gt;

&lt;p&gt;The first version I'd reach for, and the version most people write, uses a &lt;em&gt;trailing&lt;/em&gt; median. Baseline = median of the last K runs. No gap. It feels right.&lt;/p&gt;

&lt;p&gt;It's a boiling-frog trap, and I mean that literally. On a slow drift the baseline sinks at the same rate as the signal. Every step from one run to the next is within threshold, because the thing you're comparing against already moved down too. The detector congratulates itself the whole way down.&lt;/p&gt;

&lt;p&gt;I didn't take that on faith. I ran the same log through a trailing median (GAP=0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAIVE trailing-median (GAP=0):
  warns fired         : 0
  first warn at run   : None
  latest run yield    : 29.8 (started at 48.0, now down 25%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero warnings. Yield fell from 48.0 to 29.8 — a quarter gone — and the "obvious" detector never said a word. It would have caught a cliff. It is structurally blind to a slide.&lt;/p&gt;

&lt;p&gt;The fix is the lag. Compare today not against the recent past (which the decay has already infected) but against a &lt;em&gt;settled&lt;/em&gt; window further back — runs K..2K ago. That window remembers what healthy looked like. The drop is measured against memory, not against the slowly-poisoned present. Same probe, GAP=7, and it fires at run 48.&lt;/p&gt;

&lt;p&gt;If you take one thing from this post: &lt;strong&gt;a trend detector whose baseline includes recent data can't see a slow trend.&lt;/strong&gt; Make the baseline lag.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this does NOT catch (and where it cries wolf)
&lt;/h2&gt;

&lt;p&gt;I'm not going to oversell 20 lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It needs runs.&lt;/strong&gt; With &lt;code&gt;K + GAP = 14&lt;/code&gt;, the probe says &lt;code&gt;BUILDING_BASELINE&lt;/code&gt; until you have enough history. Brand-new scraper, sparse schedule — no signal yet. This is a tool for sources you hit repeatedly, which is exactly where slow decay hides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A genuine cliff also trips it — correctly, but you'll want context.&lt;/strong&gt; If a source legitimately halves overnight (they really did remove half the listings), the probe fires. That's not a false positive, but it's not decay either; it's a step change. The probe tells you &lt;em&gt;something moved&lt;/em&gt;, not &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Seasonality and legitimate shrinkage will cry wolf.&lt;/strong&gt; A source that's genuinely quieter on weekends, or a category that's actually emptying out, will look like decay. The probe has no idea your source is supposed to shrink. You'll get warnings you have to read and dismiss. A single global threshold is blunt; per-source thresholds are better, and I haven't built the per-source version into these 20 lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It assumes yield is comparable run to run.&lt;/strong&gt; If your page budget changes between runs, normalize on rows-per-page (as the probe does), not raw rows. If even the per-page meaning drifts, you need a smarter denominator than I've shown here.&lt;/p&gt;

&lt;p&gt;So: it's a smoke alarm, not a diagnosis. It earns its 20 lines by catching the one failure that every green log hides — and it will occasionally beep at burnt toast.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to actually do Monday
&lt;/h2&gt;

&lt;p&gt;Three changes, smallest first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Log yield per run.&lt;/strong&gt; Not only the exit code. One number (rows, or rows-per-page if your budget varies) written to a durable run log. If you're not logging it, you can't see the curve, full stop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert on trend, not on an absolute floor.&lt;/strong&gt; A &lt;code&gt;rows &amp;lt; N&lt;/code&gt; gate is a tripwire at one height; decay walks under it. Compare each run to a &lt;em&gt;lagged&lt;/em&gt; baseline of its own source.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make the baseline lag.&lt;/strong&gt; Trailing windows go blind to slow drift. Median of runs K..2K ago. That's the difference between "no decay detected" and "warn at run 48."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need a metrics platform to start. You need the run log you probably already half-have, plus this probe reading it. Grafana is great once you've decided what to watch. This tells you &lt;em&gt;what to watch&lt;/em&gt; before you've stood anything up.&lt;/p&gt;




&lt;p&gt;One open question I haven't settled: across 962 runs of one source, how much of the yield wobble is the source genuinely changing vs. our own throttling/proxy behavior leaking into the curve? I can see the curve move; cleanly attributing each dip is harder than I'd like. If you've separated "the source changed" from "my client changed" in a long run history, I'd genuinely like to hear how — I read every comment.&lt;/p&gt;

&lt;p&gt;Follow for the next numbers from the run log. And tell me the slowest, sneakiest scraper decay you've watched happen — the one no alert ever caught.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Aleksei Spinov — I run production scrapers (2,190 runs across 32 actors; one Trustpilot scraper at 962). Proof: &lt;a href="https://blog.spinov.online" rel="noopener noreferrer"&gt;blog.spinov.online&lt;/a&gt; and my &lt;a href="https://apify.com/knotless_cadence" rel="noopener noreferrer"&gt;Apify profile&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI disclosure: drafted with AI assistance, then edited, fact-checked, and the code run and verified by me. The run log is synthetic and deterministic; the output above is real stdout from executing the script.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>Your Scraper Collected 50 Rows. There Were 4,000.</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Sat, 06 Jun 2026 18:12:15 +0000</pubDate>
      <link>https://dev.to/0012303/your-scraper-collected-50-rows-there-were-4000-5bo4</link>
      <guid>https://dev.to/0012303/your-scraper-collected-50-rows-there-were-4000-5bo4</guid>
      <description>&lt;p&gt;A scraper can pass every check you wrote and still be wrong about the one thing you actually care about: how much it collected.&lt;/p&gt;

&lt;p&gt;No exception. No 500. No broken row. Exit code 0, logs green, every field valid. And the set on disk is a quarter of what the site actually has. I have run scrapers in production enough times to stop trusting a green run on its own, and this is the failure that taught me to count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A paginated source can serve fewer rows than it claims and never throw — page caps, hidden offset limits, infinite scroll that "ends" early.&lt;/li&gt;
&lt;li&gt;Your status check (200), schema check (valid row), and byte check (you got data) all pass. None of them counts records.&lt;/li&gt;
&lt;li&gt;The tell: declared total vs unique ids collected. Or, when there's no declared total, the page that quietly repeats an earlier page.&lt;/li&gt;
&lt;li&gt;Below is a 40-line probe you can run right now. On a source that caps at 1,500 of a declared 4,000, it returned &lt;code&gt;VERDICT: INCOMPLETE (missing 2500 rows)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;This is a &lt;em&gt;completeness&lt;/em&gt; check, not a correctness check. Different layer, different bug.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What actually goes wrong
&lt;/h2&gt;

&lt;p&gt;You write the loop everyone writes. Walk &lt;code&gt;?page=1&lt;/code&gt;, &lt;code&gt;?page=2&lt;/code&gt;, keep going until a page comes back empty. Stop. Save. Done.&lt;/p&gt;

&lt;p&gt;The source has other plans. It says it has 4,000 records — the count is right there in the envelope, or in a "Showing 4,000 results" line in the HTML. But it only ever hands out real data for the first 30 pages. Page 31 doesn't error. It doesn't return empty either. It returns page 1 again. Still HTTP 200. Still 50 valid rows. Your loop has no reason to stop, so it grinds on until its own page budget runs out, collects a pile of rows, and exits clean.&lt;/p&gt;

&lt;p&gt;You now have 5,000 rows in hand and feel great about it. Looks like plenty. The catch: only 1,500 are unique. The page cap fed you the same first page over and over, and those duplicates &lt;em&gt;hid&lt;/em&gt; the shortfall behind a big-looking row count. That is the exact shape of "50 rows passed every check while 4,000 existed" — the scraper saw a lot of rows and trusted the volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is a completeness check, not a correctness check
&lt;/h2&gt;

&lt;p&gt;Quick scope, because this lands next to three failures I've written about and it is none of them. A bad status code is the schema canary, where &lt;a href="https://blog.spinov.online/blog/http-200-is-a-lie-schema-canary/" rel="noopener noreferrer"&gt;HTTP 200 lies&lt;/a&gt; and the body is junk. A wrong field inside a valid row is &lt;a href="https://blog.spinov.online/blog/your-scraper-returned-a-clean-row-it-was-wrong/" rel="noopener noreferrer"&gt;a clean row that's still wrong&lt;/a&gt;, a different problem with its own fix. And &lt;a href="https://blog.spinov.online/blog/you-pay-for-the-bandwidth-that-returns-nothing/" rel="noopener noreferrer"&gt;bytes you paid for that returned nothing&lt;/a&gt; is a cost problem; this is a count problem. Here the run is green and every row is correct. What's wrong is the &lt;em&gt;number of rows&lt;/em&gt;: you collected fewer than exist, and nothing threw. This check lives between your scraper and the source's own claim about how many records there are. It is not about resume, crashes, ETags, 304s, or whether the data went stale. Just one question: did you get all of it.&lt;/p&gt;

&lt;p&gt;That distinction matters because the tools that catch the other three are blind here. A status check sees 200 and is happy. A schema check sees a valid row and is happy. A byte counter sees data flowing and is happy. None of them ever asks "is this &lt;em&gt;all&lt;/em&gt; of it." That question needs its own line of code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I keep meeting this
&lt;/h2&gt;

&lt;p&gt;Listing sources. Anything paginated where the platform decides how deep you're allowed to go. The scraper I've leaned on most for this — a Trustpilot review collector — has 962 production runs behind it, and reviews are paginated to the bone. "Showing N of M," page after page, with the platform free to stop serving real pages whenever it wants. That's the genre where the declared count and the collected count drift apart, and where a green run means almost nothing on its own.&lt;/p&gt;

&lt;p&gt;I want to be precise about what I'm claiming, because the cheap version of this post would inflate it. I am not going to tell you "page caps cost me X rows on site Y" — I don't keep a clean tally of how many runs hit a silent cap specifically, so I won't invent one. What I'll stand behind: across 2,190 production runs, the failure that scared me most wasn't the loud one. The loud ones page you. This one ships a confident, half-empty dataset into something downstream and waits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The probe
&lt;/h2&gt;

&lt;p&gt;Here's the whole thing. Pure stdlib, no network, no browser. The mock source lies the way real ones do, so you can watch the probe catch it before you wire it to your own fetch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="n"&gt;PAGE_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="n"&gt;DECLARED_TOTAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4000&lt;/span&gt;          &lt;span class="c1"&gt;# what the envelope claims exists
&lt;/span&gt;&lt;span class="n"&gt;HIDDEN_PAGE_CAP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;           &lt;span class="c1"&gt;# server silently refuses real data past this page
&lt;/span&gt;&lt;span class="n"&gt;PAGE_BUDGET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;              &lt;span class="c1"&gt;# every real scraper has a safety budget; so do we
# 30 pages * 50 = 1,500 reachable rows out of a declared 4,000
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;mock_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;One page, 1-based. The bug: any page past the cap serves page 1 again,
    still HTTP 200 with a valid envelope. No error, no empty page.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;served&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;HIDDEN_PAGE_CAP&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;   &lt;span class="c1"&gt;# &amp;lt;-- the silent cap
&lt;/span&gt;    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;served&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;PAGE_SIZE&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PAGE_SIZE&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DECLARED_TOTAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;page_fingerprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_naive&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Walk pages until one looks empty. It never looks empty here, so we
    stop on the page budget and exit clean -- like real code does.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;collected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first_fp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cap_at_page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;PAGE_BUDGET&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mock_api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;fp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;page_fingerprint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;first_fp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fp&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;fp&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;first_fp&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;cap_at_page&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cap_at_page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;       &lt;span class="c1"&gt;# page K repeats page 1 -&amp;gt; cap is K-1
&lt;/span&gt;        &lt;span class="n"&gt;collected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;collected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first_fp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cap_at_page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two checks do the work, and they cover the two cases you actually meet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path A — you have a declared total.&lt;/strong&gt; Compare it to your &lt;em&gt;unique&lt;/em&gt; ids, not your raw count. Raw count is the thing the duplicates inflate; unique ids is the thing that tells the truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path B — there is no declared total.&lt;/strong&gt; Plenty of sources don't give you one. Then the anchor is the fingerprint: the page that repeats an earlier page is exactly where the source quietly looped you. No &lt;code&gt;total&lt;/code&gt; needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;collected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;first_fp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cap_at_page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pages_walked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;scrape_naive&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;unique_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;collected&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;declared&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DECLARED_TOTAL&lt;/span&gt;
    &lt;span class="n"&gt;completeness&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;unique_ids&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== COMPLETENESS PROBE ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;declared total (envelope) : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;declared&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows collected (raw)      : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;collected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique ids collected      : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;unique_ids&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pages walked              : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pages_walked&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page-1 fingerprint        : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;first_fp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cap_at_page&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cap_at_page&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; repeats page 1 -&amp;gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
              &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SILENT PAGE CAP at page &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cap_at_page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INCOMPLETE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;unique_ids&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;declared&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completeness ratio        : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;unique_ids&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;declared&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;completeness&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VERDICT                   : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; (missing &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;declared&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;unique_ids&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it. This is the captured output from my machine, Python 3.13.5, no edits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== COMPLETENESS PROBE ===
declared total (envelope) : 4000
rows collected (raw)      : 5000
unique ids collected      : 1500
pages walked              : 100
page-1 fingerprint        : 323c5cd0274b
page 31 repeats page 1 -&amp;gt; SILENT PAGE CAP at page 30
completeness ratio        : 1500/4000 = 0.375
VERDICT                   : INCOMPLETE (missing 2500 rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Read it line by line
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;rows collected (raw) : 5000&lt;/code&gt; is the trap. Five thousand rows feels like a win. It's the number a naive run brags about.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;unique ids collected : 1500&lt;/code&gt; is the truth. The page cap fed back page 1 from page 31 onward, so 3,500 of those 5,000 rows are duplicates. Strip them and you have 1,500.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;page 31 repeats page 1 -&amp;gt; SILENT PAGE CAP at page 30&lt;/code&gt; is the second detector earning its place. It found the cap &lt;em&gt;without&lt;/em&gt; trusting the declared total at all — useful for every source that won't tell you how many records it has.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;completeness ratio : 1500/4000 = 0.375&lt;/code&gt; is the headline. You collected 37.5% of what the source itself says exists. Three-eighths.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;VERDICT : INCOMPLETE (missing 2500 rows)&lt;/code&gt; is the one boolean you bolt onto your run today. Green exit code, INCOMPLETE verdict. Those two are allowed to disagree, and when they do, the verdict is right.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do with this on Monday
&lt;/h2&gt;

&lt;p&gt;Add the unique-id-vs-declared check to your pipeline and fail the run loud when the ratio drops below whatever floor you trust. I'd start strict — anything under 0.95 gets a human — and loosen it once you know a given source's normal drift.&lt;/p&gt;

&lt;p&gt;If the source gives no total, keep the fingerprint check. The page that repeats an earlier page is a free signal that the source stopped serving you real data. Cheap to compute, hard to fake.&lt;/p&gt;

&lt;p&gt;And stop reporting raw row count as success. Report unique ids against the declared total, or against your own previous high-water mark for that source. Raw count is the number that lies to you the most cheerfully.&lt;/p&gt;

&lt;p&gt;One thing I'm still unsure about, and I'll say so plainly: the fingerprint trick assumes the source repeats a &lt;em&gt;whole prior page&lt;/em&gt;. Some caps don't loop — they just return a final partial page and stop, or shuffle order so no two pages match exactly. I haven't found one clean detector that covers every flavor of silent cutoff. If you've hit a cap shape that slips past both the unique-id check and the page-repeat check, that's the case I most want to hear about.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Alexey Spinov. I run production scrapers — 2,190 runs across 32 published actors, the Trustpilot collector alone at 962 — and I write up the failures that a green run hides. This post was drafted with AI assistance and edited, fact-checked, and run by me; the probe output above is captured from a real run on my machine, not generated.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow for the next batch of numbers from real runs. And tell me in the comments: what's the worst silently-incomplete dataset you've shipped before you noticed? I read every one.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>pagination</category>
    </item>
    <item>
      <title>Your Scraper Died at Row 12,000. The Rerun Pattern.</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Fri, 05 Jun 2026 18:18:31 +0000</pubDate>
      <link>https://dev.to/0012303/your-scraper-died-at-row-12000-the-rerun-pattern-3c6d</link>
      <guid>https://dev.to/0012303/your-scraper-died-at-row-12000-the-rerun-pattern-3c6d</guid>
      <description>&lt;p&gt;My scraper died at row 12,000 of 50,000, three hours in. The crash itself was cheap. A process gets OOM-killed, a quota trips, a machine reboots, it happens. The expensive part came next: I re-ran it. From zero. And paid, in time and in requests, for the 11,999 rows I already had sitting on disk.&lt;/p&gt;

&lt;p&gt;That second bill is the one nobody writes code for. This post is the code. It's about 40 lines of stdlib Python that let a crashed job pick up where it died, fetching only the missing rows and writing zero duplicates, plus the real captured output of a run that crashes and a rerun that finishes it cleanly.&lt;/p&gt;

&lt;p&gt;To be clear about scope: this is the run &lt;em&gt;after&lt;/em&gt; the crash — how to restart a long job so it finishes the work it lost without re-fetching what it already pulled and without writing a row twice. It is not retry/backoff inside a single request (that's &lt;a href="https://blog.spinov.online/blog/framework-isnt-what-breaks-your-scraper/" rel="noopener noreferrer"&gt;a different post of mine&lt;/a&gt;), not schema-drift detection (&lt;a href="https://blog.spinov.online/blog/http-200-is-a-lie-schema-canary/" rel="noopener noreferrer"&gt;the post where I said "a crash loses the run"&lt;/a&gt; — this is the part where you get the run back), not a budget kill-switch that stops a runaway, and explicitly not conditional-GET / ETag / "skip unchanged pages" — that's freshness, a separate question entirely. Just: your job died mid-way, the clock and the bill are still running, how do you resume cheap and clean.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A long scrape that dies at hour 3 of 4 didn't lose one request. It lost the whole run. Retry doesn't help here; resume does.&lt;/li&gt;
&lt;li&gt;The fix is three small things: a stable idempotency key per item, a checkpoint cursor written atomically to disk, and an upsert instead of a blind append.&lt;/li&gt;
&lt;li&gt;I ran a 5,000-row local job, killed it at row 3,000, and reran it. The rerun fetched only the missing &lt;strong&gt;2,000&lt;/strong&gt; and wrote &lt;strong&gt;zero&lt;/strong&gt; duplicates. Final output: 5,000 rows, 5,000 unique. Real captured output below.&lt;/li&gt;
&lt;li&gt;Across 2,190 production runs (962 on a single Trustpilot source), long jobs &lt;em&gt;do&lt;/em&gt; die mid-way. The cost that bites isn't the crash — it's paying to re-collect everything you already had.&lt;/li&gt;
&lt;li&gt;It's stdlib Python. No DB, no framework, no paid API. You can reproduce both runs in about five seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Retry is the wrong layer
&lt;/h2&gt;

&lt;p&gt;Retry fixes a request. It does nothing for a job. That's the whole confusion.&lt;/p&gt;

&lt;p&gt;When a scraper flakes, the reflex advice is "add retries with backoff." Good advice, for the layer it lives at. A 429, a connection reset, a slow socket: retry the &lt;em&gt;request&lt;/em&gt; a few times with jitter and most transient failures evaporate. I'm a believer; I wrote a whole post on closing resources and retrying inside a run.&lt;/p&gt;

&lt;p&gt;But think about what actually happened at row 12,000. The process is gone. The Python interpreter that held your in-memory list of results, your retry counter, your &lt;code&gt;for&lt;/code&gt; loop: all of it, evaporated. There is no request to retry, because there is no process left to retry it in. Retry operates &lt;em&gt;inside&lt;/em&gt; a run. The thing that died is the run.&lt;/p&gt;

&lt;p&gt;So the recovery layer isn't the request. It's the job. And the canonical "just add retries" advice quietly skips that level, because at the request layer everything looks handled.&lt;/p&gt;

&lt;p&gt;I bumped into this exact gap once and walked straight past it. In an earlier post about silent schema drift, I argued for making a data-shape check non-fatal, and the reason I gave was: &lt;em&gt;"a crash loses the run. If you blow up on record 12,000 of 50,000, you've thrown away the 11,999 good records you already pulled."&lt;/em&gt; True. But I used it only as an argument to not crash &lt;em&gt;this&lt;/em&gt; run. I never said what happens on the &lt;em&gt;next&lt;/em&gt; run after a crash you didn't prevent. This post is that next run.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes a rerun cheap?
&lt;/h2&gt;

&lt;p&gt;A rerun is cheap when it does only the work the first run didn't finish. To get there you need three things, and none of them is fancy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. A stable idempotency key, per item.&lt;/strong&gt; Not the row number. The row number is a lie the moment you skip something: skip 3,000 rows and item 3,001 is now "row 1" of the rerun. Key off something the &lt;em&gt;source&lt;/em&gt; gives you: an id, a URL, a SKU. Mine is &lt;code&gt;(source, item_id)&lt;/code&gt;. So that the second run can ask "do I already have this exact item?" and get a truthful yes/no regardless of order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. A checkpoint cursor, written atomically.&lt;/strong&gt; You want to flush progress to disk as you go, not hold it in memory where a crash takes it with you. The subtle part: writing the cursor itself can crash &lt;em&gt;mid-write&lt;/em&gt;, leaving you with a truncated, useless file. The fix is write-to-temp-then-rename: &lt;code&gt;os.replace()&lt;/code&gt; is atomic, so the cursor on disk is always either the old complete value or the new complete value, never a half-written one. So that even a crash &lt;em&gt;during a checkpoint&lt;/em&gt; can't corrupt your recovery state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. An upsert, not a blind append.&lt;/strong&gt; A naive scraper opens its output and appends every row it scrapes. Rerun it and you get every row twice. The pattern instead reads which keys are already written, and skips them. So that the corpus stays clean no matter how many times the job restarts.&lt;/p&gt;

&lt;p&gt;That's it. Stable key, atomic cursor, skip-what-you-have. The cleverness is in &lt;em&gt;not&lt;/em&gt; being clever.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;Pure stdlib. The "scrape" here is a deterministic generator of 5,000 items instead of a network call — on purpose, so you can run it yourself in seconds with no proxies, no keys, no target site. The mechanic being demonstrated (key + checkpoint + upsert + delta rerun) doesn't depend on the transport; swap &lt;code&gt;work_items()&lt;/code&gt; for your real fetch and the recovery logic is unchanged.&lt;/p&gt;

&lt;p&gt;The output file is line-per-record JSON. That choice matters: each row is durably appended on its own line, so a crash costs you at most one half-written final line — and the loader below skips exactly that.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;TOTAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;
&lt;span class="n"&gt;CRASH_AT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;
&lt;span class="n"&gt;OUT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrape_output.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;CURSOR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cursor.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;work_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# The "scrape". Stable per-item id, NOT a row counter.
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOTAL&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;idem_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# stable across reruns
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_done_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Rebuild what's already written. The output FILE is the source of truth.
&lt;/span&gt;    &lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;rec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;   &lt;span class="c1"&gt;# half-written final line from the crash — skip it
&lt;/span&gt;            &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_index&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Atomic: temp file + os.replace. A crash mid-write can't truncate it.
&lt;/span&gt;    &lt;span class="n"&gt;tmp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tmp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;last_index&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fileno&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The loop ties them together. On a resume, it loads the done-keys first, then walks the same item stream and skips anything already on disk — the upsert — appending only the delta and checkpointing the cursor as it goes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_done_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resume&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;fetched_this_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;duplicates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;work_items&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;idem_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;               &lt;span class="c1"&gt;# already have it -&amp;gt; skip (upsert)
&lt;/span&gt;                &lt;span class="n"&gt;duplicates&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;            &lt;span class="c1"&gt;# a blind-append script re-writes here
&lt;/span&gt;                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;CRASH_AT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simulated crash at index &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()})&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;fetched_this_run&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CURSOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the &lt;code&gt;duplicates += 1&lt;/code&gt; line. On the first run it never fires. On the rerun it fires once for every item already on disk — that counter &lt;em&gt;is&lt;/em&gt; the proof that a blind-append version of this script would have written those rows a second time, and this one didn't.&lt;/p&gt;

&lt;p&gt;The full runnable file (with the cursor reader, the summary print, and the &lt;code&gt;--resume&lt;/code&gt; flag) is at the bottom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run it: crash, then resume
&lt;/h2&gt;

&lt;p&gt;Two commands. First run starts fresh and dies at index 3,000. Second run resumes. Here's the actual terminal, copy-pasted, not cleaned up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;########## RUN 1 (fresh, will crash) ##########
Traceback (most recent call last):
  File "resume_demo.py", line 126, in &amp;lt;module&amp;gt;
    run(resume=a.resume)
  File "resume_demo.py", line 99, in run
    raise RuntimeError(f"simulated crash at index {idx}")
RuntimeError: simulated crash at index 3000
exit code: 1

########## state on disk after crash ##########
rows in output:     3000
cursor: {"last_index": 2500}

########## RUN 2 (--resume) ##########
=== RUN 2 (--resume) summary ===
resumed from cursor index : 2500
items already on disk      : 3000
fetched this run           : 2000
duplicate writes avoided   : 3000
final rows in output       : 5000
exit code: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the independent check on the file afterward, because a summary that prints its own numbers is a summary you shouldn't trust:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total lines in output: 5000
unique item_ids: 5000
duplicate lines (should be 0): 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the second run's numbers. &lt;code&gt;fetched this run: 2000&lt;/code&gt;, not 5,000. The rerun touched only the rows the crash lost. &lt;code&gt;duplicate writes avoided: 3000&lt;/code&gt; means every row that was already on disk got skipped instead of re-written. &lt;code&gt;final rows: 5000&lt;/code&gt;, and &lt;code&gt;unique item_ids: 5000&lt;/code&gt; from the independent count, so the job is genuinely complete with nothing doubled. The crash cost me the last 2,000 rows and exactly nothing else.&lt;/p&gt;

&lt;p&gt;One honest wrinkle, because I'd rather point it out than have you spot it. The crash happened at index 3,000, but the cursor on disk said &lt;code&gt;2500&lt;/code&gt;. They disagree by 500. That's not a bug, it's the design. The cursor is checkpointed every 500 rows, but every &lt;em&gt;row&lt;/em&gt; is flushed to the output file the instant it's written. So the output file, not the cursor, is the real source of truth: &lt;code&gt;load_done_keys()&lt;/code&gt; rebuilds progress from the 3,000 rows actually on disk, and the cursor is just a cheap hint. If I trusted the cursor alone I'd have re-fetched 500 rows I already had. Trusting the durable output instead, I re-fetched zero. Pick the more durable record as your truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this costs you in production
&lt;/h2&gt;

&lt;p&gt;The reason I care about this isn't the demo. It's that long jobs really do die, and I have the run counter to say so.&lt;/p&gt;

&lt;p&gt;I run scrapers in production: 2,190 runs across 32 published actors, with one Trustpilot review scraper at 962 runs by itself (that's a lifetime run meter on my Apify profile, &lt;code&gt;knotless_cadence&lt;/code&gt;, as of mid-2026; not a controlled study, just a long-running counter). When you run one source nearly a thousand times, you stop asking &lt;em&gt;if&lt;/em&gt; a multi-hour job will get interrupted and start assuming it will. OOM on a big page batch, a proxy pool hiccup, a quota wall, a deploy that restarts the worker: any of them ends the process, and the process is the run.&lt;/p&gt;

&lt;p&gt;Here's the cost math, on the numbers from my opening. A 50,000-row job that dies at row 12,000. Rerun-from-zero re-pays for all 50,000, and 24% of that work (the 12,000 you'd already done) is pure waste. But flip the crash point. Most jobs die &lt;em&gt;late&lt;/em&gt;, not early, because the longer they run the more chances they have to hit something. A job that dies at row 40,000 of 50,000 and reruns from zero re-collects 40,000 rows you already had: you pay 80% of the bill a second time to recover the last 20%. Resume-the-delta pays for 10,000. That's the whole pitch: the later the crash, the more brutal the rerun-from-zero penalty, and the more a stable key plus a durable output saves you.&lt;/p&gt;

&lt;p&gt;You might reasonably say: 2,000 rows on one laptop isn't a distributed production crawl. Fair. It isn't. The mechanic is identical, but a single-machine flat file is the &lt;em&gt;simplest&lt;/em&gt; place it lives, not the only one — and the next section is exactly where this version stops being enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this breaks
&lt;/h2&gt;

&lt;p&gt;I'd rather hand you the failure modes than let you find them at row 12,000 of your own job.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The source gives you no stable id.&lt;/strong&gt; The whole pattern hangs off the key. If items have no id, URL, or natural unique field, you're stuck choosing a worse one: content hash (breaks the instant the content legitimately changes), position (breaks the instant you skip), or fuzzy match (breaks in ways you won't notice for weeks). This is the genuinely hard part, and I haven't solved it cleanly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One machine, one file.&lt;/strong&gt; Reading done-keys from a local file is fine for a single worker. Run the same job across a pool of workers and a flat file becomes a race: two workers can both read "not done," both fetch, both write. At that point the done-keys set has to live somewhere shared and atomic — Redis &lt;code&gt;SETNX&lt;/code&gt;, a unique constraint in Postgres, an &lt;code&gt;INSERT ... ON CONFLICT DO NOTHING&lt;/code&gt;. Same idea, different home for the key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flat-file upsert is not a database.&lt;/strong&gt; &lt;code&gt;load_done_keys()&lt;/code&gt; reads the whole output to rebuild the set. At a few thousand rows that's instant. At tens of millions it's a startup cost you'll feel, and the right move is a real keyed store, where "have I seen this key" is an index lookup, not a file scan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It doesn't help if the &lt;em&gt;source&lt;/em&gt; died, not your job.&lt;/strong&gt; Resume assumes the data is still there to re-fetch. If the site is down, rate-limiting you to a crawl, or has removed the rows since your first pass, resuming cleanly still gets you an incomplete corpus. The pattern recovers &lt;em&gt;your&lt;/em&gt; failure, not the world's.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is the boundary I keep relearning. A clean resume is a promise about your bookkeeping, not about the source.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change on Monday
&lt;/h2&gt;

&lt;p&gt;If you run anything that takes more than a few minutes, do these three before the next long job:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick the idempotency key before you write the scraper.&lt;/strong&gt; It's the one decision the whole recovery story depends on, and it's free to get right up front and expensive to retrofit. If the source has a stable id, use it. If it doesn't, that's a design problem to solve now, not at row 12,000.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make your output durable per-row and treat it as the source of truth.&lt;/strong&gt; Append line-per-record and flush. Then "what have I already done" is a question you answer from disk, not from a process that might not exist anymore. The cursor is a hint; the output is the truth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make the rerun the default way you finish a job, not the emergency.&lt;/strong&gt; A job you can stop and resume at any row is a job you can also run in cheap chunks, pause for a deploy, or split across a maintenance window. Resume isn't just crash insurance — it's what makes a long job something you can actually operate.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Open question I haven't solved cleanly, and I'd genuinely like your answer: what's your idempotency key when the source gives you &lt;em&gt;no&lt;/em&gt; stable id? Content hash, scroll position, fuzzy match on a few fields — every option I've tried has a failure mode that shows up weeks later in a way that's painful to debug. What do you actually use in prod?&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Full script (&lt;code&gt;resume_demo.py&lt;/code&gt;, stdlib only — run &lt;code&gt;python3 resume_demo.py&lt;/code&gt; then &lt;code&gt;python3 resume_demo.py --resume&lt;/code&gt;):&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;resume_demo.py — resume a crashed job WITHOUT re-fetching or double-writing.

The &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrape&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; is a deterministic local generator (no network, no browser, no paid
API) — the mechanic shown (idempotency key + atomic checkpoint + upsert +
delta-only rerun) does not depend on the transport, so anyone can reproduce this.

  python3 resume_demo.py            # run 1: crashes at CRASH_AT
  python3 resume_demo.py --resume   # run 2: finishes only the missing delta
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;

&lt;span class="n"&gt;TOTAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;                     &lt;span class="c1"&gt;# items in the whole job
&lt;/span&gt;&lt;span class="n"&gt;CRASH_AT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;                  &lt;span class="c1"&gt;# run 1 dies right before processing this index
&lt;/span&gt;&lt;span class="n"&gt;CHECKPOINT_EVERY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;           &lt;span class="c1"&gt;# flush the cursor to disk this often
&lt;/span&gt;&lt;span class="n"&gt;OUT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrape_output.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;      &lt;span class="c1"&gt;# durable, line-per-record output
&lt;/span&gt;&lt;span class="n"&gt;CURSOR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cursor.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;           &lt;span class="c1"&gt;# last-known-good progress marker
&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;work_items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrape&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Yields 5,000 records with a STABLE per-item id.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOTAL&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;idem_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Idempotency key = (source, item_id). Stable across reruns, not position.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_done_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Rebuild the set of keys already written. The output file is the truth.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;rec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;  &lt;span class="c1"&gt;# half-written final line from the crash — skip it
&lt;/span&gt;            &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;item_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cursor_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_index&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Atomic write: temp file + os.replace. Crash mid-write can&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t truncate it.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;tmp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cursor_path&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.tmp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;last_index&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fileno&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tmp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursor_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_cursor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cursor_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cursor_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cursor_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;The &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;expensive&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; step. In prod this is the network fetch you pay for.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_done_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resume&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CURSOR&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;resumed_from&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read_cursor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CURSOR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resume&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="n"&gt;fetched_this_run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;duplicates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;work_items&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;idem_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="c1"&gt;# idempotency: already have it, skip
&lt;/span&gt;                &lt;span class="n"&gt;duplicates&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;        &lt;span class="c1"&gt;# a blind-append script would re-write here
&lt;/span&gt;                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;CRASH_AT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;simulated crash at index &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;rec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;fetched_this_run&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;CHECKPOINT_EVERY&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CURSOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CURSOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TOTAL&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;final_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;load_done_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUT&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RUN 2 (--resume)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resume&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RUN 1 (fresh)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; summary ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resumed from cursor index : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resumed_from&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items already on disk      : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;fetched_this_run&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fetched this run           : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fetched_this_run&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate writes avoided   : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;duplicates&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final rows in output       : &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_rows&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--resume&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;store_true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;continue an existing output instead of starting fresh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Written by Aleksey Spinov. I write up the cost and failure math from real production scraping — 2,190 runs and counting. Follow for the next one, and tell me your idempotency key for a source with no stable id — I read every comment.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI disclosure: drafted with AI assistance; the pattern, the script, and every number in this post were produced and verified by me. The Python here was run locally (stdlib, no third-party deps); the crash, the resume, and the independent file check shown are the real output, not a mock-up.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>dataengineering</category>
      <category>reliability</category>
    </item>
    <item>
      <title>A 30-Line Probe That Tells You If a Page Needs a Browser</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Fri, 05 Jun 2026 01:03:40 +0000</pubDate>
      <link>https://dev.to/0012303/a-30-line-probe-that-tells-you-if-a-page-needs-a-browser-1pj</link>
      <guid>https://dev.to/0012303/a-30-line-probe-that-tells-you-if-a-page-needs-a-browser-1pj</guid>
      <description>&lt;p&gt;Half the "you don't need a browser" takes on my feed this week are right. None of them tell you how to check. They tell you headless Chrome is expensive — true — and then leave you exactly where you started: guessing, per target, whether you can skip it.&lt;/p&gt;

&lt;p&gt;You don't have to guess. Whether a page needs a browser is a question you can answer from the raw HTTP response, before you launch anything. Here's a 30-line probe that does it, and the real output from running it on ten named public URLs.&lt;/p&gt;

&lt;p&gt;To be clear about scope: this is the decision you make &lt;em&gt;before&lt;/em&gt; you start a job — &lt;em&gt;should I launch Chrome at all for this URL&lt;/em&gt; — not how to survive headless once it's running, not how much the raw HTML costs you in LLM tokens, not the proxy bandwidth bill. Just: browser, or no browser, on this target.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A page needs a browser only if the data you want isn't in the raw HTTP HTML. You can test that with &lt;code&gt;urllib&lt;/code&gt;, no headless required.&lt;/li&gt;
&lt;li&gt;The probe reads three cheap signals from the raw response — visible-text size, an embedded JSON/hydration blob, and whether your target text literally appears — and votes &lt;code&gt;NO_BROWSER&lt;/code&gt; / &lt;code&gt;JS_REQUIRED&lt;/code&gt; / &lt;code&gt;MAYBE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;I ran it on 10 named public URLs. &lt;strong&gt;6 of 10 returned their data without a browser.&lt;/strong&gt; Two genuinely needed JS, two were borderline (&lt;code&gt;MAYBE&lt;/code&gt; — the probe says so on purpose).&lt;/li&gt;
&lt;li&gt;It's a heuristic. It will be wrong on scroll-loaded content, data behind auth, and anti-bot walls — the post is honest about exactly where.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why "launch Chrome just in case" is a tax, not caution
&lt;/h2&gt;

&lt;p&gt;The default a lot of scrapers reach for is browser-by-default: put Playwright or headless Chrome in front of every target because "it's more reliable." It feels safe. It is not free.&lt;/p&gt;

&lt;p&gt;I run scrapers in production — 2,190 runs across 32 published actors, the Trustpilot one alone at 962 runs. Here's the part that doesn't show up in any tutorial: a headless instance costs memory, CPU, and cold-start time &lt;em&gt;per run&lt;/em&gt;. Multiply that by hundreds of runs and the "just in case" browser is a standing line item — paid on every page, including the pages that would have handed you the data over plain HTTP in 80 milliseconds.&lt;/p&gt;

&lt;p&gt;So the default is backwards. It should be HTTP-first, browser-on-fallback. And the thing that decides which path a URL takes shouldn't be a vibe. It should be a measurement.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually tells you a page needs a browser
&lt;/h2&gt;

&lt;p&gt;A page needs a browser when the data you want exists only &lt;em&gt;after&lt;/em&gt; JavaScript runs. That's it. So the probe asks the inverse question of the raw HTTP HTML: is the data already here?&lt;/p&gt;

&lt;p&gt;Three cheap signals, all readable from the bytes &lt;code&gt;urllib&lt;/code&gt; gives you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Visible-text size.&lt;/strong&gt; Strip &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;style&amp;gt;&lt;/code&gt;, strip tags, measure what's left. A real article leaves tens of kilobytes of text. An empty SPA shell leaves almost nothing — the body is a &lt;code&gt;&amp;lt;div id="root"&amp;gt;&lt;/code&gt; and a bundle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An embedded data blob.&lt;/strong&gt; Lots of "JS-heavy" sites actually ship their data inside the first HTML response as JSON: &lt;code&gt;__NEXT_DATA__&lt;/code&gt;, &lt;code&gt;__NUXT__&lt;/code&gt;, &lt;code&gt;window.__INITIAL_STATE__&lt;/code&gt;, or a &lt;code&gt;&amp;lt;script type="application/ld+json"&amp;gt;&lt;/code&gt;. If that blob is there, you don't need a browser — you need a JSON parser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The needle.&lt;/strong&gt; If you know the exact text you're after (a price, a review snippet, a name), the cleanest test is: does that string appear in the raw HTML at all? Present → no browser. Absent → the browser is rendering it in.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The probe
&lt;/h2&gt;

&lt;p&gt;Pure stdlib. No &lt;code&gt;requests&lt;/code&gt;, no Selenium, no Playwright — the whole point is to decide &lt;em&gt;before&lt;/em&gt; a browser exists. If you can run &lt;code&gt;python3&lt;/code&gt;, you can run this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gzip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urlopen&lt;/span&gt;

&lt;span class="n"&gt;UA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 needs_a_browser/1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;HYDRATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__NEXT_DATA__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__NUXT__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__INITIAL_STATE__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__APOLLO_STATE__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;window.__data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/ld+json&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accept-Encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gzip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gzip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gzip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decompress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;visible_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?is)&amp;lt;(script|style)\b.*?&amp;lt;/\1&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# drop scripts
&lt;/span&gt;    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?s)&amp;lt;[^&amp;gt;]+&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                        &lt;span class="c1"&gt;# drop tags
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;visible_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;has_blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;HYDRATION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;needle_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NO_BROWSER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;needle_hit&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JS_REQUIRED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;text_len&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;has_blob&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JS_REQUIRED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;            &lt;span class="c1"&gt;# empty shell, nothing to parse
&lt;/span&gt;    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;text_len&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;has_blob&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NO_BROWSER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;             &lt;span class="c1"&gt;# data already in the raw HTML
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MAYBE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;                  &lt;span class="c1"&gt;# borderline — the probe says so
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text_len&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;B blob=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;has_blob&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; needle=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;needle_hit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the logic. The &lt;code&gt;__main__&lt;/code&gt; block just loops over the URLs you pass, tallies the verdicts, and prints &lt;code&gt;X of N pages returned data without a browser&lt;/code&gt;. Full file at the end.&lt;/p&gt;

&lt;p&gt;The thresholds (&lt;code&gt;500&lt;/code&gt;, &lt;code&gt;2000&lt;/code&gt;) are deliberately blunt. They're not a model fitted to anything — they're "is there clearly nothing here" and "is there clearly a lot here," with an honest gap in the middle called &lt;code&gt;MAYBE&lt;/code&gt;. You can tune them. The point isn't the constants, it's that the question is answerable from bytes you already have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real run
&lt;/h2&gt;

&lt;p&gt;I pointed it at ten public URLs. Mix of static content, a forum, a couple of deliberate JS controls, and two versions of the same site so you can see the probe flip. Here's the actual output, copy-pasted, not cleaned up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NO_BROWSER   https://en.wikipedia.org/wiki/Web_scraping
             text=30581B blob=True needle=False
NO_BROWSER   https://news.ycombinator.com/
             text=4050B blob=False needle=False
NO_BROWSER   https://old.reddit.com/r/webscraping/
             text=7185B blob=False needle=False
JS_REQUIRED  https://www.reddit.com/r/webscraping/
             text=37B blob=False needle=False
MAYBE        https://quotes.toscrape.com/
             text=1745B blob=False needle=False
JS_REQUIRED  https://quotes.toscrape.com/js/
             text=98B blob=False needle=False
NO_BROWSER   https://www.python.org/
             text=7064B blob=True needle=False
NO_BROWSER   https://github.com/scrapy/scrapy
             text=5204B blob=True needle=False
MAYBE        https://books.toscrape.com/
             text=1883B blob=False needle=False
NO_BROWSER   https://httpbin.org/html
             text=3596B blob=False needle=False

6 of 10 pages returned data without a browser  ::  {'NO_BROWSER': 6, 'JS_REQUIRED': 2, 'MAYBE': 2}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six of ten. Most of them didn't need a browser at all. (Re-run it and a live page like the HN front page will report a slightly different byte count — the story list changes — but the verdict holds.)&lt;/p&gt;

&lt;p&gt;The two that did are the interesting ones, and they're the same site twice. &lt;code&gt;www.reddit.com/r/webscraping/&lt;/code&gt; came back with &lt;strong&gt;37 bytes&lt;/strong&gt; of visible text — a shell. &lt;code&gt;old.reddit.com/r/webscraping/&lt;/code&gt; came back with &lt;strong&gt;7,185 bytes&lt;/strong&gt; of real post titles. Same content, same subreddit; the new front-end renders client-side, the old one ships HTML. If your target is new Reddit, you need a browser or the JSON API. If it's old Reddit, you'd be launching Chrome to read text that was already sitting in the response. That single row is the whole argument for measuring instead of guessing.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;quotes.toscrape.com&lt;/code&gt; pair is the cleanest controlled version of the same thing: &lt;code&gt;/&lt;/code&gt; ships the quotes as HTML, &lt;code&gt;/js/&lt;/code&gt; builds them in the browser. I'll come back to why &lt;code&gt;/&lt;/code&gt; showed up as &lt;code&gt;MAYBE&lt;/code&gt; here and not &lt;code&gt;NO_BROWSER&lt;/code&gt; — it's the honest edge of this thing.&lt;/p&gt;

&lt;p&gt;And Wikipedia, python.org, the Scrapy GitHub page — all &lt;code&gt;blob=True&lt;/code&gt;. They look JavaScript-heavy in a browser, but the data is right there in the first response as JSON-LD or &lt;code&gt;__NEXT_DATA__&lt;/code&gt;. Launching a browser for those is pure overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  When you know what you're looking for, ask directly
&lt;/h2&gt;

&lt;p&gt;The structural signals (text size, blob) are a guess about whether &lt;em&gt;any&lt;/em&gt; useful data is present. If you know the &lt;em&gt;specific&lt;/em&gt; thing you want, skip the guess. Run it with &lt;code&gt;--needle&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 needs_a_browser.py &lt;span class="s2"&gt;"https://quotes.toscrape.com/"&lt;/span&gt; &lt;span class="nt"&gt;--needle&lt;/span&gt; &lt;span class="s2"&gt;"Einstein"&lt;/span&gt;
&lt;span class="go"&gt;NO_BROWSER   https://quotes.toscrape.com/
             text=1745B blob=False needle=True

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;python3 needs_a_browser.py &lt;span class="s2"&gt;"https://quotes.toscrape.com/js/"&lt;/span&gt; &lt;span class="nt"&gt;--needle&lt;/span&gt; &lt;span class="s2"&gt;"Einstein"&lt;/span&gt;
&lt;span class="go"&gt;JS_REQUIRED  https://quotes.toscrape.com/js/
             text=98B blob=False needle=False
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same site, opposite verdicts, and now there's no ambiguity: the word "Einstein" is literally in the raw HTML of &lt;code&gt;/&lt;/code&gt;, and literally absent from &lt;code&gt;/js/&lt;/code&gt; until JavaScript runs. Notice &lt;code&gt;/&lt;/code&gt; was a &lt;code&gt;MAYBE&lt;/code&gt; on structure alone (1,745 bytes — right in the borderline band) but a confident &lt;code&gt;NO_BROWSER&lt;/code&gt; once I asked about the actual data. That's the lesson: a needle beats a heuristic. When you can name the field you're scraping, test for the field.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this probe is wrong
&lt;/h2&gt;

&lt;p&gt;It's a heuristic. I'd rather tell you its failure modes than let you find them in production.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scroll / lazy-loaded content.&lt;/strong&gt; A page can ship a fat, healthy HTML head and still load the rows you want on scroll via XHR. The probe sees a big page, votes &lt;code&gt;NO_BROWSER&lt;/code&gt;, and misses that &lt;em&gt;your specific rows&lt;/em&gt; arrive later. The needle catches this; the structural signals don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data behind auth or interaction.&lt;/strong&gt; If the content only appears after a login or a click, an unauthenticated GET can't see it. The probe will read the logged-out shell.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-bot walls.&lt;/strong&gt; Some targets don't even let a plain &lt;code&gt;urllib&lt;/code&gt; request finish. When I pointed this same probe at a Trustpilot review page from a datacenter IP, it didn't return &lt;code&gt;NO_BROWSER&lt;/code&gt; or &lt;code&gt;JS_REQUIRED&lt;/code&gt; — it threw an &lt;code&gt;ssl handshake timed out&lt;/code&gt;, twice, repeatably. The connection got cut at the TLS layer before any HTML came back. That's not a failure of the probe; it's the probe telling you something true. This target won't talk to a bare HTTP client. You're going to a real client (browser and/or residential proxy) regardless of what the HTML would have said — which, for a scraper I've run 962 times in production, is a useful thing to learn in one second instead of one debugging session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;MAYBE&lt;/code&gt; band is real.&lt;/strong&gt; A page with ~1.5 KB of text and no blob is genuinely ambiguous from bytes alone. The probe doesn't fake confidence there. Treat &lt;code&gt;MAYBE&lt;/code&gt; as "fetch one sample with a browser, look, then decide for the batch" — not as a verdict.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is the design choice I care about most. A probe that always answers yes or no is lying part of the time. This one tells you when it doesn't know.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change on Monday
&lt;/h2&gt;

&lt;p&gt;Flip the default. Don't reach for the browser first; reach for the probe first.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Probe the target before you write the scraper.&lt;/strong&gt; One run tells you which transport you're building for. It's the cheapest decision in the whole job, and you make it before you've written a line of extraction code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefer the needle to the structural guess.&lt;/strong&gt; If you know the price, the review text, the SKU you're after, test for &lt;em&gt;that&lt;/em&gt;. The structural signals are a fallback for when you don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route &lt;code&gt;MAYBE&lt;/code&gt; and &lt;code&gt;ERROR&lt;/code&gt; to a human-eyeballed sample, not to a blanket "use Chrome."&lt;/strong&gt; Launching a browser on every ambiguous URL is just the browser-by-default tax wearing a disguise.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'll be straight about the limit one more time: 6-of-10 is the result on &lt;em&gt;these ten URLs&lt;/em&gt;, not a law about the web. Point the probe at your own targets and you'll get your own number — that's the entire idea. The value isn't my six. It's that you can compute yours in the time it takes to read this paragraph, instead of paying for a Chrome instance on every page that never needed one.&lt;/p&gt;

&lt;p&gt;Here's the open question I haven't solved cleanly: the scroll/lazy-load case. The needle catches it &lt;em&gt;if&lt;/em&gt; I know a value that's only on a later page of results — but for an open-ended crawl where I don't yet know what's there, structural signals can't distinguish "all the data is here" from "the first screen is here and the rest is one XHR away." If you've found a cheap, no-browser way to detect lazy-loading from the raw response, I'd genuinely like to see it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Full script (&lt;code&gt;needs_a_browser.py&lt;/code&gt;, stdlib only):&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;needs_a_browser.py — decide if a page needs a browser BEFORE you launch one.
Usage: python3 needs_a_browser.py URL [URL ...] --needle &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gzip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.request&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urlopen&lt;/span&gt;

&lt;span class="n"&gt;UA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 needs_a_browser/1.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;HYDRATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__NEXT_DATA__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__NUXT__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__INITIAL_STATE__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__APOLLO_STATE__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;window.__data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/ld+json&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accept-Encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gzip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;urlopen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gzip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gzip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decompress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;visible_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?is)&amp;lt;(script|style)\b.*?&amp;lt;/\1&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?s)&amp;lt;[^&amp;gt;]+&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;visible_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;has_blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;HYDRATION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;needle_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;needle&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NO_BROWSER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;needle_hit&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JS_REQUIRED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;text_len&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;has_blob&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JS_REQUIRED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;text_len&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;has_blob&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NO_BROWSER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;verdict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MAYBE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text_len&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;B blob=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;has_blob&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; needle=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;needle_hit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--needle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;tally&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;why&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;probe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;needle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;why&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ERROR&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;tally&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tally&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;             &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;why&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tally&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NO_BROWSER&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; pages returned data without a browser  ::  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tally&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Written by Aleksey Spinov. I write up the cost and failure math from real production scraping — 2,190 runs and counting. Follow for the next one, and if you've got a clean way to detect lazy-loaded data without a browser, drop it in the comments — I read every one.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI disclosure: drafted with AI assistance; the probe, the URL list, and every verdict in this post were produced and verified by me. The Python here was run locally (stdlib, no third-party deps); the output shown is the real run, not a mock-up.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>performance</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>You Pay for the Bandwidth That Returns Nothing</title>
      <dc:creator>Alex Spinov </dc:creator>
      <pubDate>Thu, 04 Jun 2026 02:05:22 +0000</pubDate>
      <link>https://dev.to/0012303/you-pay-for-the-bandwidth-that-returns-nothing-2akf</link>
      <guid>https://dev.to/0012303/you-pay-for-the-bandwidth-that-returns-nothing-2akf</guid>
      <description>&lt;p&gt;A proxy invoice that says &lt;code&gt;24.79 GB · $198.28&lt;/code&gt; reads like you bought 24.79 GB of data. You didn't. You bought 24.79 GB of &lt;em&gt;traffic&lt;/em&gt;. Some of it came back with rows. Some came back with a block page, a 404, a CAPTCHA challenge, or a retry of a page that already failed. The meter doesn't care which. It counts the bytes that left the proxy, and it bills all of them at the same rate.&lt;/p&gt;

&lt;p&gt;That gap, between bytes you paid for and rows you got back, is where money quietly leaves a healthy run. Not a runaway loop. Not an outage. A run that finished, looked fine in the dashboard, and still spent a third to a half of its bandwidth on responses that returned nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-GB billing charges for failed requests, retries, and asset loads — not just rows. ("You pay for bandwidth consumed, whether requests succeed or fail." — Titan Network, 13 Apr 2026.)&lt;/li&gt;
&lt;li&gt;In a model of a 100k-row job on a protected target, a low-success datacenter config spent &lt;strong&gt;53% of its bytes returning zero rows&lt;/strong&gt;; a high-success residential config spent &lt;strong&gt;3%&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;$/GB&lt;/code&gt; is not cost per row. The cheaper-per-GB pool was cheaper per row &lt;em&gt;here&lt;/em&gt; — but the winner &lt;strong&gt;flips&lt;/strong&gt; once success drops below ~9%.&lt;/li&gt;
&lt;li&gt;I don't have a dollar billing ledger. The numbers below are a model on published proxy prices. Run it with your own success rate and price.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the meter actually counts
&lt;/h2&gt;

&lt;p&gt;I run scrapers in production — 2,190 runs across 32 published actors, the Trustpilot one alone at 962 runs. That's the part I can say with a straight face: I've watched a lot of real traffic. What I &lt;em&gt;don't&lt;/em&gt; have is a per-run dollar ledger that itemizes every gigabyte. So I'm not going to paste an invoice I don't hold and call it data.&lt;/p&gt;

&lt;p&gt;Here's what I can say from watching those logs. The bytes that return nothing aren't tail noise. They're a structural line item. Three things feed it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failed responses.&lt;/strong&gt; A request that gets a 403, a challenge page, or an empty card still pulled bytes over the wire. Usually smaller than a real page. A block page isn't heavy. But it isn't free either, and at scale there are a lot of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retries.&lt;/strong&gt; Every failed request you re-attempt spends bandwidth again, and the retry often fails again. This is the multiplier most people forget. Titan Network put a number on it: moving success rate from 60% to 95% cuts your total request count by about 63%, because you stop re-issuing the misses ("Web Scraping Cost at Scale," Titan Network, 13 Apr 2026).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Asset and redirect tax.&lt;/strong&gt; A browser-driven load on a "healthy" page pulls more than the HTML — assets, redirects, sometimes a login bounce. Even your &lt;em&gt;successful&lt;/em&gt; traffic carries weight that never becomes a row.&lt;/p&gt;

&lt;p&gt;None of that shows up as a problem. The run succeeds. The dashboard is green. The bill is just… higher than the rows would suggest.&lt;/p&gt;

&lt;h2&gt;
  
  
  A model, not a bill
&lt;/h2&gt;

&lt;p&gt;So I wrote the smallest thing that makes the gap visible. It's stdlib Python, no network, no keys. It takes a job (how many rows you want), a success rate, average response sizes, a retry policy, and a &lt;code&gt;$/GB&lt;/code&gt; price — and it tells you what you actually pay &lt;em&gt;per collected row&lt;/em&gt;, versus the naive number you'd get if only the row-returning bytes were billed.&lt;/p&gt;

&lt;p&gt;The dollar prices are placeholders. I marked them as illustrative in the code and I'll mark them again here: &lt;strong&gt;$8/GB&lt;/strong&gt; is Titan Network's stated average for residential; &lt;strong&gt;$1.20/GB&lt;/strong&gt; stands in for a cheap datacenter-style pool. Residential in 2026 runs roughly &lt;strong&gt;$2–$15/GB&lt;/strong&gt;, with $8 landing in the mid-to-premium band (triangulated across Proxyway's 2026 tests, aimultiple's pricing comparison, and Titan's own figures). Swap in yours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RunConfig&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;target_rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;       &lt;span class="c1"&gt;# rows you actually want
&lt;/span&gt;    &lt;span class="n"&gt;success_rate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;    &lt;span class="c1"&gt;# fraction of requests that return a usable row
&lt;/span&gt;    &lt;span class="n"&gt;row_resp_kb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;     &lt;span class="c1"&gt;# avg KB of a request that returned a row
&lt;/span&gt;    &lt;span class="n"&gt;fail_resp_kb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;    &lt;span class="c1"&gt;# avg KB of a request that returned no row
&lt;/span&gt;    &lt;span class="n"&gt;asset_overhead&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;  &lt;span class="c1"&gt;# extra byte fraction from assets/redirects
&lt;/span&gt;    &lt;span class="n"&gt;retries_per_fail&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;price_per_gb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;    &lt;span class="c1"&gt;# ILLUSTRATIVE — set yours
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;requests_for_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_rows&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success_rate&lt;/span&gt;
    &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests_for_rows&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_rows&lt;/span&gt;
    &lt;span class="n"&gt;retries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retries_per_fail&lt;/span&gt;
    &lt;span class="n"&gt;KB_PER_GB&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;

    &lt;span class="n"&gt;row_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_rows&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;row_resp_kb&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;asset_overhead&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;fail_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fail_resp_kb&lt;/span&gt;

    &lt;span class="n"&gt;total_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row_bytes&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fail_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;KB_PER_GB&lt;/span&gt;
    &lt;span class="n"&gt;returned_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row_bytes&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;KB_PER_GB&lt;/span&gt;
    &lt;span class="n"&gt;total_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_gb&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price_per_gb&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wasted_share&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_gb&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;returned_gb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;paid_for_per_returned_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_gb&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;returned_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effective_cost_per_row&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_cost&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two configs, same job: collect 100,000 rows from a protected target. One cheap datacenter pool that gets blocked a lot. One pricey residential pool that gets through.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cheap_dc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;datacenter (cheap/GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;180&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pricey_res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;residential (pricey/GB)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;180&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8.00&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;--- datacenter pool (cheap per GB) ---
  success rate           : 35%
  price (illustrative)   : $1.20/GB
  bandwidth billed       : 50.60 GB
  ... returned rows      : 24.03 GB
  ... returned NOTHING   : 26.57 GB  (53% of the bill)
  paid-for per 1GB data  : 2.11x
  total cost             : $60.72
  naive  cost/row        : $0.288 per 1,000 rows
  EFFECTIVE cost/row     : $0.607 per 1,000 rows

--- residential pool (pricey per GB) ---
  success rate           : 95%
  price (illustrative)   : $8.00/GB
  bandwidth billed       : 24.79 GB
  ... returned rows      : 24.03 GB
  ... returned NOTHING   : 0.75 GB  (3% of the bill)
  paid-for per 1GB data  : 1.03x
  total cost             : $198.28
  naive  cost/row        : $1.923 per 1,000 rows
  EFFECTIVE cost/row     : $1.983 per 1,000 rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look at the datacenter run. To collect 24 GB of rows it billed &lt;strong&gt;50.6 GB&lt;/strong&gt;, so it paid for &lt;strong&gt;2.11×&lt;/strong&gt; the data it kept. More than half the invoice, &lt;strong&gt;53%&lt;/strong&gt;, returned nothing. The residential run paid for 1.03×: almost everything it bought, it kept.&lt;/p&gt;

&lt;p&gt;That's the whole point in two numbers. Same job, same row sizes. One config converts bandwidth into rows; the other converts about half of it into block pages and retries you still pay for.&lt;/p&gt;

&lt;h2&gt;
  
  
  So the cheap proxy is the trap, right?
&lt;/h2&gt;

&lt;p&gt;No. And this is where I almost wrote the wrong article.&lt;/p&gt;

&lt;p&gt;My first instinct was the clean contrarian line: &lt;em&gt;cheap-per-GB is actually more expensive per row.&lt;/em&gt; But the model wouldn't cooperate. At these numbers the cheap datacenter pool costs &lt;strong&gt;$0.607 per 1,000 rows&lt;/strong&gt; and the pricey residential costs &lt;strong&gt;$1.983&lt;/strong&gt; — the datacenter is &lt;em&gt;31% the per-row cost&lt;/em&gt;. The 6.7× price gap ($1.20 vs $8.00) is just bigger than its waste penalty. The cheap pool wins here, even bleeding 53% of its bytes.&lt;/p&gt;

&lt;p&gt;So the honest claim isn't "cheap is a trap." It's narrower and more useful: &lt;strong&gt;&lt;code&gt;$/GB&lt;/code&gt; and cost-per-row are different numbers, and which proxy is cheaper depends on how hard the target fights back.&lt;/strong&gt; The waste fraction is a &lt;em&gt;lever on price&lt;/em&gt;, not a verdict.&lt;/p&gt;

&lt;p&gt;To find where it flips, I held residential at 95% and dropped the datacenter success rate — the way a target gets harder when it tightens its anti-bot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flip point — datacenter success rate falling on a harder target:
  dc success   35% : 53% of bytes return nothing, $0.607/1k rows -&amp;gt; cheaper: datacenter
  dc success   20% : 70% of bytes return nothing, $0.975/1k rows -&amp;gt; cheaper: datacenter
  dc success   12% : 81% of bytes return nothing, $1.547/1k rows -&amp;gt; cheaper: datacenter
  dc success    9% : 86% of bytes return nothing, $2.024/1k rows -&amp;gt; cheaper: RESIDENTIAL  &amp;lt;-- flip
  dc success    8% : 87% of bytes return nothing, $2.262/1k rows -&amp;gt; cheaper: RESIDENTIAL
  dc success    5% : 92% of bytes return nothing, $3.550/1k rows -&amp;gt; cheaper: RESIDENTIAL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's the flip, around &lt;strong&gt;9% success&lt;/strong&gt;. Below it, the cheap pool is wasting so much bandwidth (86% of bytes returning nothing) that even at one-sixth the price it loses on a per-row basis. Above it, cheap wins.&lt;/p&gt;

&lt;p&gt;So "the expensive proxy is cheaper" is a &lt;em&gt;regime&lt;/em&gt;, not a law. It's true on the targets that beat your cheap pool into the single digits. It's false on the targets your cheap pool handles fine. The only way to know which target you're on is to measure your own success rate and put it in the model — not to pick a proxy by its sticker price per GB.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd change on Monday
&lt;/h2&gt;

&lt;p&gt;Stop pricing proxies by &lt;code&gt;$/GB&lt;/code&gt; in isolation. That number is the cost of the &lt;em&gt;traffic&lt;/em&gt;, and you don't want traffic. You want rows.&lt;/p&gt;

&lt;p&gt;Three things that move the per-row number more than the sticker price:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Log success rate per target, not globally.&lt;/strong&gt; A 90% average can hide a target sitting at 12%, and that target is eating your bill. The flip lives in the per-target number.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cap retries per failed request, and watch the multiplier.&lt;/strong&gt; At 60% success you're issuing ~1.7 requests per row before retries; the retries pile on top. Re-issuing a request that fails the same way twice is just buying the same block page again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run the model before you switch pools.&lt;/strong&gt; A "cheaper" pool that drops your success rate can cost more per row. A "pricey" pool that lifts it can cost less. You can't tell from the price tag.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'll repeat the limit because it matters: this is a model on published prices, not a measured invoice. I don't have a per-run dollar ledger to show you. What I do have is the shape of the traffic from a lot of production runs — the part that returns nothing is real and it's structural — and a 60-line script that turns your own success rate into a per-row cost. The dollars are yours to fill in.&lt;/p&gt;

&lt;p&gt;The honest open question for me: I've been treating &lt;code&gt;fail_resp_kb&lt;/code&gt; (the size of a block/challenge response) as a flat 60 KB. On JS-challenge targets a "failed" attempt can pull a full interactive challenge page — heavier than the real data page. If your failures are &lt;em&gt;bigger&lt;/em&gt; than your successes, the waste fraction climbs faster than this model shows. I haven't pinned that distribution down per target yet. If you've measured the byte size of your failures versus your successes, I'd genuinely like to see the numbers.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by Aleksey Spinov. I write up the cost and failure math from real production scraping — 2,190 runs and counting. Follow for the next one, and if you've metered the bytes a failed request actually costs you, drop the number in the comments — I read every one.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI disclosure: drafted with AI assistance; all numbers, the model, and its output were produced and verified by me. The Python in this post was run locally (stdlib, no network); the output shown is the real run, not a mock-up.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>proxies</category>
    </item>
  </channel>
</rss>
