<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SIÁN Agency</title>
    <description>The latest articles on DEV Community by SIÁN Agency (@sian-agency).</description>
    <link>https://dev.to/sian-agency</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3854792%2Fcb57fd08-1d47-4084-97aa-8c4879d72af0.png</url>
      <title>DEV Community: SIÁN Agency</title>
      <link>https://dev.to/sian-agency</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sian-agency"/>
    <language>en</language>
    <item>
      <title>Schema Drift Is the Silent Killer. Here's What to Log So You Actually Catch It.</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Tue, 02 Jun 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/sian-agency/schema-drift-is-the-silent-killer-heres-what-to-log-so-you-actually-catch-it-15dm</link>
      <guid>https://dev.to/sian-agency/schema-drift-is-the-silent-killer-heres-what-to-log-so-you-actually-catch-it-15dm</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Most scraper "bugs" aren't bugs. They're the source site changing its data shape underneath you while your selectors and your code keep returning success. This is schema drift, and you cannot prevent it. You can only detect it. The detection has to be designed in. Here's how we do it.&lt;/p&gt;

&lt;p&gt;I have a low opinion of any scraper that does not log a per-field availability rate. It's the single most useful number you can produce, and almost nobody produces it.&lt;/p&gt;

&lt;p&gt;The premise: every record you scrape has a set of expected fields. After every run, you compute, for each field, the percentage of records that had a non-null value for it. You log that number. You alarm on it.&lt;/p&gt;

&lt;p&gt;That's it. That's the whole technique.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;A scraper has three failure modes you actually care about:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Total failure&lt;/strong&gt; — the run errors out, you get a stack trace, you fix it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial failure&lt;/strong&gt; — some URLs fail, you log them, you retry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema drift&lt;/strong&gt; — every URL "succeeds," every record looks fine, but a field has silently gone from 98% present to 30% present.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first two are loud. The third is silent. Schema drift is what produces "the dashboard looks weird" support tickets a week after the cause.&lt;/p&gt;

&lt;p&gt;Real example, from our Sephora product info actor: in March, the site moved the "ingredients" field from a top-level dropdown into a tab inside a modal. Our existing selector still found &lt;em&gt;something&lt;/em&gt; on the page — a placeholder div — and our code happily wrote &lt;code&gt;ingredients=""&lt;/code&gt; to the dataset. No error, no alarm. The CSV had ingredient column. The values were empty for new products. Detected eight days later by a customer who tried to filter by allergen.&lt;/p&gt;

&lt;p&gt;If we had been logging field availability, we would have seen the ingredient field drop from 96% present to 11% present in a single deploy and caught it inside an hour.&lt;/p&gt;

&lt;h2&gt;
  
  
  The teardown of why this gets missed
&lt;/h2&gt;

&lt;p&gt;Most scrapers track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rows extracted per run.&lt;/li&gt;
&lt;li&gt;Errors per run.&lt;/li&gt;
&lt;li&gt;Run duration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of those move when schema drift happens. The row count is the same. The error rate is zero. The run duration is the same. You have to be looking at field-level data to see it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The replacement pattern
&lt;/h2&gt;

&lt;p&gt;After every run, compute and log this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Counter&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;field_availability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected_fields&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns the % of records where each field is non-null.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;expected_fields&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
                &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;expected_fields&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At the end of the run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;availability&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field_availability&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EXPECTED_FIELDS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;field_availability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Alarm on regression vs last run.
&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;KeyValueStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_field_availability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pct&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pct&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# 10-point drop is suspicious
&lt;/span&gt;        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;availability regression: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pct&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;KeyValueStore&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_field_availability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;availability&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three log lines per run. Persistent state across runs. An alarm when any field drops more than 10 percentage points.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to monitor specifically
&lt;/h2&gt;

&lt;p&gt;Field availability is the one that catches the most. Two more I find pay for themselves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Value distribution shift.&lt;/strong&gt; For numeric fields (price, rating, count), log the median and p95. If price suddenly goes from "median ~$30" to "median 0.0" you have a parser bug, not just availability drift.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selector hit count.&lt;/strong&gt; When you fall back from primary to secondary selector, log it. If your fallback rate goes from 1% to 40%, the primary selector is on its way out — you have a week or so before it goes to zero.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three together (availability, distribution, fallback rate) catch ~90% of schema drift before it produces customer-visible bugs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs11vuw7ui4vby9w0yx0a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs11vuw7ui4vby9w0yx0a.png" alt="Fig. 1 — Field availability across runs. The drop on day 4 is schema drift, not a bug." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;We added per-field availability logging across the Sephora actor portfolio in February. In the four months since:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6 schema-drift incidents caught and fixed within 48 hours of the source-site change.&lt;/li&gt;
&lt;li&gt;Mean detection lag went from "a customer noticed" (~7 days) to "the alarm fired" (~12 hours, the gap being our run cadence).&lt;/li&gt;
&lt;li&gt;One incident where the field availability &lt;em&gt;dropped&lt;/em&gt; in a way that &lt;em&gt;was&lt;/em&gt; expected (Sephora removed a field site-wide); we acknowledged and updated the schema. Net cost: 20 minutes, including writing the postmortem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost: about 30 lines of code per actor, run-time overhead measured in milliseconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this is wrong
&lt;/h2&gt;

&lt;p&gt;Field availability is a poor signal when your input is inherently heterogeneous. If you're scraping listings where some products have ingredients and most don't, "30% have ingredients" might be normal. The technique still works — you just compare &lt;em&gt;to the previous run&lt;/em&gt;, not to an absolute target. A 10-point drop is the alarm; the absolute number doesn't matter.&lt;/p&gt;

&lt;p&gt;If you're scraping a homogeneous catalogue (every product has a title and a price), absolute thresholds work fine. Title &amp;lt;99% present? Something is wrong.&lt;/p&gt;

&lt;p&gt;We packaged the field-availability + distribution + fallback-rate triple into a small middleware that sits at the end of every actor we ship — first deployed on the &lt;a href="https://apify.com/sian.agency/best-sephora-product-information-extractor?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=jonas&amp;amp;utm_content=schema-drift-field-availability" rel="noopener noreferrer"&gt;Sephora product info actor&lt;/a&gt; and rolled out portfolio-wide. Three lines to wire up, alarms in your inbox the day a source site decides to change their schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which of the three signals is missing from your scraper right now?&lt;/strong&gt; Drop it in the comments — I'll show you the smallest version that works.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Jonas Keller&lt;/em&gt;&lt;em&gt;, Senior Automation Architect at SIÁN Agency. Find more from Jonas on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=jonas&amp;amp;utm_content=schema-drift-field-availability" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>One Playwright Selector Trick Nobody Talks About: getByRole</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Sun, 31 May 2026 08:30:00 +0000</pubDate>
      <link>https://dev.to/sian-agency/one-playwright-selector-trick-nobody-talks-about-getbyrole-100e</link>
      <guid>https://dev.to/sian-agency/one-playwright-selector-trick-nobody-talks-about-getbyrole-100e</guid>
      <description>&lt;p&gt;Everyone reaches for &lt;code&gt;page.locator(".some-class")&lt;/code&gt; first. They shouldn't.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;getByRole&lt;/code&gt; is the most stable selector in Playwright and almost nobody uses it for scraping. They think it's a testing-library thing. It's not. It's a way of asking the page "what is this element semantically" instead of "what classname does the design system happen to use this week."&lt;/p&gt;

&lt;p&gt;That distinction is what kept our Facebook video transcript actor running through three Facebook redesigns this past year.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-item checklist
&lt;/h2&gt;

&lt;p&gt;When does &lt;code&gt;getByRole&lt;/code&gt; work? When the site is built by people who care about accessibility. Which is: more sites than you think, especially big ones with legal requirements (US government, EU compliance, large e-commerce).&lt;/p&gt;

&lt;p&gt;Check before you skip it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open the accessibility tree&lt;/strong&gt; in Chrome DevTools (Elements → Accessibility tab). If your target element shows a role and an accessible name, &lt;code&gt;getByRole&lt;/code&gt; will find it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buttons and headings are nearly always tagged correctly.&lt;/strong&gt; Even sloppy sites give you &lt;code&gt;role="button"&lt;/code&gt; and proper heading levels because the design system enforced it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forms expose &lt;code&gt;label&lt;/code&gt; even when the visual design hides it.&lt;/strong&gt; &lt;code&gt;getByLabel("Email")&lt;/code&gt; works on inputs that don't visibly show "Email" anywhere.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The trick
&lt;/h2&gt;

&lt;p&gt;Compare:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Class-name brittle&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;followBtn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;._a9-_._a9-_2._a9-_8._a9-_z&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// getByRole — survives layout changes&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;followBtn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sr"&gt;/follow/i&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first one breaks the day Facebook tweaks their CSS-in-JS hash. The second one keeps working until they remove the button entirely.&lt;/p&gt;

&lt;p&gt;Same for headings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// "Get the post title"&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;heading&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That works on every site that uses &lt;code&gt;&amp;lt;h1&amp;gt;&lt;/code&gt; correctly. Which is most of them, because Google penalises sites that don't.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiolpm92cu0hrh01xce5t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiolpm92cu0hrh01xce5t.png" alt="Fig. 1 — Selector stability over a 30-day window. getByRole survives layout churn." width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick case
&lt;/h2&gt;

&lt;p&gt;The Facebook transcript actor extracts video metadata from public posts. Facebook ships A/B tests constantly — class names change every couple of weeks. Selectors built on &lt;code&gt;_a9-_8&lt;/code&gt; chains broke regularly.&lt;/p&gt;

&lt;p&gt;I rewrote the extractor to use &lt;code&gt;getByRole&lt;/code&gt; for everything that had a meaningful role:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Author name → &lt;code&gt;getByRole('link', { name: /^[\w. ]+$/ })&lt;/code&gt; near the post header.&lt;/li&gt;
&lt;li&gt;Post text → no role, but &lt;code&gt;[data-ad-comet-preview="message"]&lt;/code&gt; (a &lt;code&gt;data-&lt;/code&gt; attribute, also stable).&lt;/li&gt;
&lt;li&gt;Video player → &lt;code&gt;getByRole('article')&lt;/code&gt; containing a &lt;code&gt;&amp;lt;video&amp;gt;&lt;/code&gt; element.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before: ~8 selector breakages per quarter. After: 1 in the last 6 months, and that one was a real structural change (Facebook moved to a new post type), not a class rename.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CTA you didn't ask for
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;getByRole&lt;/code&gt; is now the first thing every new actor we write tries — including the rebuild of the &lt;a href="https://apify.com/sian.agency/facebook-ai-transcript-extractor?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=playwright-getbyrole-stable-selectors" rel="noopener noreferrer"&gt;Facebook AI Transcript Extractor&lt;/a&gt;. CSS-class selectors are reserved for the cases where the site's accessibility story is genuinely broken (rare in 2026 — most sites have been audited at least once).&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open your scraper. Run a search for &lt;code&gt;page.locator(&lt;/code&gt;&lt;/strong&gt; with a CSS class chain. &lt;strong&gt;How many can you replace with &lt;code&gt;getByRole&lt;/code&gt;?&lt;/strong&gt; Drop the count in the comments — I'll bet it's more than half.&lt;/p&gt;

&lt;p&gt;Agree, disagree, or have a site where &lt;code&gt;getByRole&lt;/code&gt; falls apart? Reply.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Nova Chen&lt;/em&gt;&lt;em&gt;, Automation Dev Advocate at SIÁN Agency. Find more from Nova on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=playwright-getbyrole-stable-selectors" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping Without Tests Is Gambling. And the House Always Wins.</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Fri, 29 May 2026 06:30:00 +0000</pubDate>
      <link>https://dev.to/sian-agency/scraping-without-tests-is-gambling-and-the-house-always-wins-3ilh</link>
      <guid>https://dev.to/sian-agency/scraping-without-tests-is-gambling-and-the-house-always-wins-3ilh</guid>
      <description>&lt;p&gt;Nobody writes tests for scrapers. I get it. The site changes, your tests break, you feel like you spent Tuesday writing tests &lt;em&gt;for the site you don't control&lt;/em&gt;. So you skip them.&lt;/p&gt;

&lt;p&gt;Then the site changes again. Your scraper silently returns empty rows. The dashboard goes blank. Your client texts at 11pm. You discover, in the cold light of debug, that this exact failure was deterministic and could have been caught in 30 seconds by a single fixture-based test.&lt;/p&gt;

&lt;p&gt;The house always wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-item checklist
&lt;/h2&gt;

&lt;p&gt;What scrapers actually need to test:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extraction against a frozen HTML fixture.&lt;/strong&gt; Save a copy of the page once. Run the parser against it. Assert the fields. This catches &lt;em&gt;your&lt;/em&gt; bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema validation against a live response.&lt;/strong&gt; Periodically (daily, weekly), hit one real URL and validate the output shape. This catches &lt;em&gt;their&lt;/em&gt; changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke test the full pipeline against a known-good URL.&lt;/strong&gt; End-to-end. One URL. Asserts that you get one row out, with the expected fields. This catches integration breakage.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need a Jest config or a pytest empire. You need three test files.&lt;/p&gt;

&lt;h2&gt;
  
  
  The replacement: a fixture-first test in &amp;lt;10 lines
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# tests/test_extractor.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;my_scraper.extract&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;extract_comment&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_youtube_comment_extraction&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests/fixtures/youtube_comment_2026-04-01.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_comment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@somecreator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;likes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1247&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;great video&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then your &lt;code&gt;extract_comment(html)&lt;/code&gt; is a pure function — give it HTML, get a dict back. No browser, no network. Runs in milliseconds. Survives a CI minute budget. Catches every regression in your &lt;em&gt;parsing&lt;/em&gt; code instantly.&lt;/p&gt;

&lt;p&gt;Save the fixture by literally hitting the URL once and writing the response to disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scripts/refresh_fixture.py
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.youtube.com/watch?v=...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests/fixtures/youtube_comment_2026-04-01.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run it once a quarter. When the test starts failing, refresh the fixture, fix the extractor, commit both. That's the loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F033yygnzdasl2ozna51f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F033yygnzdasl2ozna51f.png" alt="Fig. 1 — Mean time to detect a scraper bug. Tests collapse the gap." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick case
&lt;/h2&gt;

&lt;p&gt;On our YouTube comments scraper, fixture-based tests caught &lt;strong&gt;three&lt;/strong&gt; parsing regressions before they ever reached production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A field rename (&lt;code&gt;likeCount&lt;/code&gt; → &lt;code&gt;likeCount&lt;/code&gt; plus a thousand-separator format change).&lt;/li&gt;
&lt;li&gt;A new "pinned" badge that broke our author-name selector.&lt;/li&gt;
&lt;li&gt;A timestamp format change from "2 days ago" to "2d".&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three would have shipped silently. The cron would still run. The CSV would still write. The fields would just be wrong or empty. Instead, the test failed in CI on the PR that introduced the change, fifteen minutes after the fixture was last refreshed.&lt;/p&gt;

&lt;p&gt;The cost of writing the test the first time: 20 minutes. The cost of the bugs it caught, if shipped: at minimum a refund and an apology each.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CTA you didn't ask for
&lt;/h2&gt;

&lt;p&gt;Every actor we ship now starts with three test files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tests/test_extract.py&lt;/code&gt; — fixture-based unit tests for parsing.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_schema.py&lt;/code&gt; — Pydantic / Zod schema check on a live URL, run on a schedule.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_smoke.py&lt;/code&gt; — single-URL end-to-end check on every deploy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's the most boring testing pyramid you've ever seen and it has paid for itself an embarrassing number of times — the &lt;a href="https://apify.com/sian.agency/cheapest-youtube-comments-scraper?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=scraping-without-tests-is-gambling" rel="noopener noreferrer"&gt;YouTube comments scraper&lt;/a&gt; is where it caught the most regressions in 2026.&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open your scraper. Do you have a &lt;code&gt;tests/&lt;/code&gt; folder?&lt;/strong&gt; Drop "yes" or "no" in the comments. If "no" — what's stopping you?&lt;/p&gt;

&lt;p&gt;Agree, disagree, or have a fixture strategy that actually works? Reply.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Nova Chen&lt;/em&gt;&lt;em&gt;, Automation Dev Advocate at SIÁN Agency. Find more from Nova on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=scraping-without-tests-is-gambling" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>automation</category>
      <category>softwareengineering</category>
      <category>testing</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Why Your Requests + BeautifulSoup Stack Will Fail in Production</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Tue, 26 May 2026 08:30:00 +0000</pubDate>
      <link>https://dev.to/sian-agency/why-your-requests-beautifulsoup-stack-will-fail-in-production-34kl</link>
      <guid>https://dev.to/sian-agency/why-your-requests-beautifulsoup-stack-will-fail-in-production-34kl</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — &lt;code&gt;requests&lt;/code&gt; plus &lt;code&gt;BeautifulSoup&lt;/code&gt; is the right tool for tutorials, side projects, and one-off audits. It is the wrong tool for any scraper that has to run unsupervised, longer than a quarter, against a site that has even basic bot defenses. I've watched a dozen teams discover this the expensive way. Here's the diagnosis and the replacement.&lt;/p&gt;

&lt;p&gt;I'm not anti-&lt;code&gt;requests&lt;/code&gt;. The library is fast, predictable, and elegant. For 30% of scraping tasks it's still what I reach for first. The problem is that the &lt;em&gt;rest&lt;/em&gt; of the scraping pipeline — JavaScript-rendered content, fingerprinting checks, modern auth flows, lazy loading — silently breaks the assumptions &lt;code&gt;requests&lt;/code&gt; is built on.&lt;/p&gt;

&lt;p&gt;Most teams discover this in stages. Here's the timeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 1 — "It works"
&lt;/h2&gt;

&lt;p&gt;You write the first version. &lt;code&gt;requests.get(url)&lt;/code&gt; returns 200, &lt;code&gt;BeautifulSoup&lt;/code&gt; parses the response, you find your selectors, you ship. Tests pass against the small URL set you tested with. Lunch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 2 — "Some pages return empty"
&lt;/h2&gt;

&lt;p&gt;You notice maybe 5% of pages return rows where half the fields are &lt;code&gt;None&lt;/code&gt;. You add a check, log the URL, retry. The retry sometimes works.&lt;/p&gt;

&lt;p&gt;What's actually happening: those pages render their data in JavaScript after the initial response. &lt;code&gt;requests&lt;/code&gt; got the HTML skeleton. The data was never in it. The retries that "work" are coincidence — sometimes the cached page has stale rendered data; sometimes a CDN ships a different variant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 3 — "We're getting 403s"
&lt;/h2&gt;

&lt;p&gt;The target site rolled out a fingerprinting check. &lt;code&gt;requests&lt;/code&gt; sends a default User-Agent that screams &lt;code&gt;python-requests/2.31.0&lt;/code&gt;. You add headers. It works for two days. They tightened the check — now they look at TLS fingerprint, not just User-Agent. &lt;code&gt;requests&lt;/code&gt; uses the system OpenSSL TLS stack, which is different from any real browser's. The block returns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 4 — "We need a session, but it's stateful"
&lt;/h2&gt;

&lt;p&gt;Login flow now requires a CSRF token, which is rendered in JavaScript, which &lt;code&gt;requests&lt;/code&gt; can't run. You spend two days reverse-engineering the login flow, find the API endpoint behind it, hit that directly. Works for six weeks. They rotate the auth scheme.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 5 — "Let's just use Playwright"
&lt;/h2&gt;

&lt;p&gt;You finally migrate. Most of the team is annoyed because the rewrite took longer than they wanted. The team that does it later is annoyed for the same reason.&lt;/p&gt;

&lt;h2&gt;
  
  
  The teardown
&lt;/h2&gt;

&lt;p&gt;The fundamental issue: &lt;code&gt;requests&lt;/code&gt; is an HTTP client. Modern websites are browser applications. The thing you're scraping is the &lt;em&gt;output of running JavaScript&lt;/em&gt;, not a static document. You can fight that for a while — by reverse-engineering APIs, faking TLS fingerprints, hand-rolling JS interpreters — but you're paying interest on a debt you took on the day you reached for &lt;code&gt;requests&lt;/code&gt; instead of a real browser.&lt;/p&gt;

&lt;p&gt;Specific failure modes you're going to hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JavaScript-rendered content.&lt;/strong&gt; The HTML you fetch contains &lt;code&gt;&amp;lt;div id="root"&amp;gt;&amp;lt;/div&amp;gt;&lt;/code&gt; and not much else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLS fingerprinting.&lt;/strong&gt; &lt;code&gt;requests&lt;/code&gt; looks like Python; real browsers look like Chrome/Firefox. Block lists distinguish them easily.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lazy-loading.&lt;/strong&gt; Data appears in the DOM only after scroll, click, or visibility events. Static fetch never triggers them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modern auth.&lt;/strong&gt; OAuth, CSRF tokens injected via JS, cookie-based session validation that requires running scripts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-automation challenges.&lt;/strong&gt; Cloudflare, PerimeterX, DataDome — all rely on running JavaScript to validate the client.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;requests&lt;/code&gt; answers none of these. Playwright (or Puppeteer) answers all of them, because Playwright &lt;em&gt;is&lt;/em&gt; a browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  The replacement pattern
&lt;/h2&gt;

&lt;p&gt;Skip the year of pain. Start with Playwright. Use &lt;code&gt;requests&lt;/code&gt; only when you've measured that the data is in the static HTML and the site has no fingerprinting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;user_agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (...)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;viewport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;width&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1920&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;height&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1080&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Block heavy resources for speed.
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*.{png,jpg,jpeg,gif,svg,woff,woff2}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domcontentloaded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Wait for the *data* to appear, not just the document.
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[data-product-id]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;extract_fields&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five things &lt;code&gt;requests&lt;/code&gt; can't give you that Playwright does for free:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;JavaScript execution — your selectors target rendered DOM, not the source.&lt;/li&gt;
&lt;li&gt;Realistic TLS fingerprint — Chromium does this for you.&lt;/li&gt;
&lt;li&gt;Cookie/session handling that matches a real browser.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;wait_for_selector&lt;/code&gt; — semantic waits instead of &lt;code&gt;time.sleep&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Routing controls — block what you don't need, accelerate what you do.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When &lt;code&gt;requests&lt;/code&gt; is still right
&lt;/h2&gt;

&lt;p&gt;Static documentation sites. Open RSS/Atom feeds. JSON APIs that don't require login. PDFs and CSVs hosted on S3. Anything where you've actually fetched the URL, looked at the response body, and confirmed your data is in it.&lt;/p&gt;

&lt;p&gt;That's a real category. Just don't assume the &lt;em&gt;next&lt;/em&gt; site you scrape will fall into it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ojfcfeip0812d5iidnz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ojfcfeip0812d5iidnz.png" alt="Fig. 1 — Failure modes by stack. requests+BS4 hits four walls a real browser doesn't." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;Across our actor portfolio, the migration ratio settled around 80/20 — Playwright for 80% of jobs, &lt;code&gt;requests&lt;/code&gt; for the 20% where the data is genuinely static. The 80% includes our entire &lt;a href="https://apify.com/sian.agency/best-sephora-product-catalog-extractor?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=jonas&amp;amp;utm_content=requests-beautifulsoup-fails-production" rel="noopener noreferrer"&gt;Sephora catalog pipeline&lt;/a&gt;, which spent its first version as a &lt;code&gt;requests + BeautifulSoup&lt;/code&gt; script and never made it past month 2. The Playwright rewrite has been running unsupervised for 14 months.&lt;/p&gt;

&lt;p&gt;If your scraper is currently 100% &lt;code&gt;requests&lt;/code&gt;, your sample size isn't "this works fine." Your sample size is "the sites I've scraped so far happen to have static HTML."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which of the five failure modes have you shipped to production?&lt;/strong&gt; Drop the symptom in the comments — I'll point at the fix.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Jonas Keller&lt;/em&gt;&lt;em&gt;, Senior Automation Architect at SIÁN Agency. Find more from Jonas on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=jonas&amp;amp;utm_content=requests-beautifulsoup-fails-production" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>automation</category>
      <category>python</category>
      <category>softwareengineering</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Stop Fighting the DOM. Selector-First Thinking Will Save Your Scraper.</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Sun, 24 May 2026 09:00:00 +0000</pubDate>
      <link>https://dev.to/sian-agency/stop-fighting-the-dom-selector-first-thinking-will-save-your-scraper-2bp9</link>
      <guid>https://dev.to/sian-agency/stop-fighting-the-dom-selector-first-thinking-will-save-your-scraper-2bp9</guid>
      <description>&lt;p&gt;Most broken scrapers I see have the same shape: someone wrote the extraction logic &lt;em&gt;first&lt;/em&gt; and the selectors &lt;em&gt;second&lt;/em&gt;. The selectors were an afterthought — whatever worked in DevTools at 2am.&lt;/p&gt;

&lt;p&gt;That's backwards. Selectors are the contract between your code and the page. Get them wrong and the rest of your scraper is irrelevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mindset shift
&lt;/h2&gt;

&lt;p&gt;Selector-first thinking means: before you write a single line of extraction code, you decide &lt;em&gt;how the data is identified&lt;/em&gt;. Not "how do I get the price?" but "what does the page tell me, programmatically, that this thing is a price?"&lt;/p&gt;

&lt;p&gt;Three answers, in order of preference:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Semantics&lt;/strong&gt; — &lt;code&gt;getByRole&lt;/code&gt;, &lt;code&gt;getByLabel&lt;/code&gt;, &lt;code&gt;getByText&lt;/code&gt;. These mirror what an accessibility tree exposes. They survive design changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data attributes&lt;/strong&gt; — &lt;code&gt;data-testid&lt;/code&gt;, &lt;code&gt;data-product-id&lt;/code&gt;, &lt;code&gt;itemprop&lt;/code&gt;. Devs often add these for their own tests; you get to free-ride.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured data&lt;/strong&gt; — JSON-LD, microdata, OpenGraph. The page is already telling Google what's a price; let it tell you too.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;CSS classes are last resort. Class names are styling, not identity. They change when the design changes. They're the equivalent of asking for "the third button from the top" — works until someone rearranges the menu.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-item checklist
&lt;/h2&gt;

&lt;p&gt;Before you write a selector:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open the accessibility tree&lt;/strong&gt; in DevTools (Chrome: Elements → Accessibility tab). If the data has a role and an accessible name, use &lt;code&gt;getByRole&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search the page source for &lt;code&gt;application/ld+json&lt;/code&gt;.&lt;/strong&gt; If it's there and contains your fields, parse it directly. No DOM walking needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look for &lt;code&gt;data-*&lt;/code&gt; attributes near the data.&lt;/strong&gt; Devs leave testing hooks everywhere. Use theirs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If none of those work, &lt;em&gt;then&lt;/em&gt; fall back to CSS or XPath. And when you do, anchor to something stable — a parent landmark, an aria-label, a &lt;code&gt;data-&lt;/code&gt; attribute — not just a class chain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-line replacement
&lt;/h2&gt;

&lt;p&gt;Here's the priority I use in every new actor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;extractPrice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// 1. Structured data first.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ld&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;script[type="application/ld+json"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                       &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ld&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;offers&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;offers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// 2. Semantic selectors.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;priceByLabel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getByLabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^price$/i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;priceByLabel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;priceByLabel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// 3. Data attributes.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;priceByData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="price"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;priceByData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;priceByData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// 4. Last resort: CSS class. Logged loudly so we know we're in fallback.&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Falling back to CSS selector — selector audit needed.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.price-tag&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;textContent&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the warn() in the fallback path. When that warning starts appearing in your logs, it means the site changed its higher-priority signals and you're one design refresh away from breakage. Fix it before it breaks, not after.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmrzau5rmfeyc65gx9tm3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmrzau5rmfeyc65gx9tm3.png" alt="Fig. 1 — Selector-priority ladder. Top is most stable. Bottom is most fragile." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick case
&lt;/h2&gt;

&lt;p&gt;On our Idealista actor, the priority order above turned a "fix the selector every 6 weeks" routine into a "fix the selector twice a year" routine. The JSON-LD path catches 95% of listings without ever touching the DOM. The accessibility-role fallback catches another 4%. The CSS fallback fires on edge-case property types and tells us when a new layout has shipped — usually a week before any of our other monitoring would have noticed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CTA you didn't ask for
&lt;/h2&gt;

&lt;p&gt;This selector ladder is the second thing every actor we ship gets, right after the request blocking from last week's post — see it in action in the &lt;a href="https://apify.com/sian.agency/smart-idealista-scraper?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=selector-first-thinking-stop-fighting-dom" rel="noopener noreferrer"&gt;Idealista actor&lt;/a&gt;. It's so consistent we made it a util.&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open your scraper's selector code right now.&lt;/strong&gt; Count how many class-name chains you have versus semantic / structured-data lookups. Drop the ratio in the comments. Bonus points for the longest CSS chain — I bet someone has &lt;code&gt;.product-grid &amp;gt; .item:nth-child(3) &amp;gt; .price &amp;gt; span &amp;gt; strong&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Agree, disagree, or have a site that genuinely needs CSS chains? Reply.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Nova Chen&lt;/em&gt;&lt;em&gt;, Automation Dev Advocate at SIÁN Agency. Find more from Nova on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=selector-first-thinking-stop-fighting-dom" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>programming</category>
      <category>webdev</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Fri, 22 May 2026 12:30:00 +0000</pubDate>
      <link>https://dev.to/sian-agency/a-10-line-playwright-trick-that-saved-me-hours-on-every-sephora-run-512a</link>
      <guid>https://dev.to/sian-agency/a-10-line-playwright-trick-that-saved-me-hours-on-every-sephora-run-512a</guid>
      <description>&lt;p&gt;Most Playwright tutorials teach you to scrape a single page. Real scrapers need to scrape thousands. The thing that kills you isn't the selector — it's everything Playwright does &lt;em&gt;before&lt;/em&gt; it touches the selector.&lt;/p&gt;

&lt;p&gt;By default, Playwright loads a page like a human visiting a website. It downloads CSS, fonts, analytics scripts, A/B testing pixels, hero images, lazy-loaded carousels, and three different chat widgets. On a product catalog page, that's 4–6 MB of stuff you don't need. Times 10,000 pages, that's the difference between a 20-minute run and a 3-hour run.&lt;/p&gt;

&lt;p&gt;Here's the 10-line route handler I drop into every actor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;BLOCKED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;image&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;media&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;font&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;stylesheet&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;**/*&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;route&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;resourceType&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;url&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;BLOCKED&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;includes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/google-analytics|doubleclick|hotjar|segment|gtm/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Two lists: resource types you don't need, and tracking domains you definitely don't need.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-item checklist before you ship this
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Test that your data is still there.&lt;/strong&gt; Some sites lazy-load product info into image &lt;code&gt;data-&lt;/code&gt; attributes. Aborting images can sometimes break extraction. Run with and without the route handler and diff the output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't block scripts.&lt;/strong&gt; Modern sites build the DOM with JS. Aborting scripts will give you an empty page. (CSS and fonts are safe — Playwright doesn't need them to find selectors.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch for sites that detect this.&lt;/strong&gt; Some bot-detection scripts check whether you fetched the analytics pixel. If your success rate drops after enabling this, allow the analytics domains back through.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhxha1lwypf9284884fi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmhxha1lwypf9284884fi.png" alt="Fig. 1 — Page weight before vs after the block list. Same DOM, less network." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick case
&lt;/h2&gt;

&lt;p&gt;On our Sephora product info actor, this single change cut average page load from 4.8s to 1.3s. Across a 5000-product catalog scrape, that's the difference between 6.5 hours and 1.8 hours. Same selectors, same data, same success rate. We just stopped downloading hero images of moisturizers we never look at.&lt;/p&gt;

&lt;p&gt;It also dropped our Apify compute units per run by ~60%, which directly affects what we charge customers. Faster scraper, lower cost, same output. The route handler now ships with the &lt;a href="https://apify.com/sian.agency/best-sephora-product-information-extractor?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=10-line-playwright-resource-blocking" rel="noopener noreferrer"&gt;Sephora product info actor&lt;/a&gt; and every new scraper after it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CTA you didn't ask for
&lt;/h2&gt;

&lt;p&gt;This route handler ships with our starter actor template. New scrapers get it on day one. Old scrapers got it bolted on the first time we noticed runtime &amp;gt; 1 hour.&lt;/p&gt;

&lt;p&gt;The pattern works on any browser-based scraper — Playwright, Puppeteer, Selenium with CDP. The shape is always: tell the browser what &lt;em&gt;not&lt;/em&gt; to load, before you tell it what to find.&lt;/p&gt;

&lt;p&gt;One quick note for the JS-heavy among you: the same pattern applies to Puppeteer's &lt;code&gt;page.setRequestInterception(true)&lt;/code&gt; — same idea, slightly different API. Same wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drop your slowest scraper's runtime in the comments.&lt;/strong&gt; I'll guess what's eating your minutes. (Hint: it's probably hero images.)&lt;/p&gt;

&lt;p&gt;Agree, disagree, or have a site where blocking images breaks something subtle? Reply.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Nova Chen&lt;/em&gt;&lt;em&gt;, Automation Dev Advocate at SIÁN Agency. Find more from Nova on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=10-line-playwright-resource-blocking" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Stop Building Fragile Scrapers — Build Actors Instead</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Mon, 18 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/sian-agency/stop-building-fragile-scrapers-build-actors-instead-2ifc</link>
      <guid>https://dev.to/sian-agency/stop-building-fragile-scrapers-build-actors-instead-2ifc</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — A "scraper" is a script that ran once. An "actor" is a unit of work with an input contract, an output schema, observability, and a billing model. Same code, completely different operational surface. We migrated our Bayut property pipeline from the first to the second this quarter and the support load dropped 70%.&lt;/p&gt;

&lt;p&gt;I get sent a lot of scraper repos to "review" — usually after they've broken in production. They look surprisingly similar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One Python file, 300–600 lines.&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;main()&lt;/code&gt; that loops over URLs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;requests.get()&lt;/code&gt; plus &lt;code&gt;BeautifulSoup&lt;/code&gt; plus a &lt;code&gt;try/except: pass&lt;/code&gt; that swallows everything.&lt;/li&gt;
&lt;li&gt;Output written to a CSV called &lt;code&gt;output.csv&lt;/code&gt; in the working directory.&lt;/li&gt;
&lt;li&gt;A cron job that triggers it nightly. Sometimes a Slack webhook on failure that stopped working six months ago.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what I call &lt;strong&gt;a script that ran once&lt;/strong&gt;. The fact that it ran in production doesn't make it production code.&lt;/p&gt;

&lt;p&gt;The teardown is always the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five failure modes you inherit when you ship a script
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No input contract.&lt;/strong&gt; The script reads URLs from a hardcoded list or a file path that only exists on your laptop. New requirement → edit the file → redeploy → hope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No output schema.&lt;/strong&gt; Whatever fields happened to be present this run get written. When the source site adds a column, the CSV silently widens. When the source site removes a column, downstream breaks at parse time, three hops away from the cause.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No observability.&lt;/strong&gt; "Did it run last night?" is answered by SSH-ing to the box and &lt;code&gt;ls -la output.csv&lt;/code&gt;. Run history is the file's mtime. Failure mode is "the file is older than expected."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No retries with backoff.&lt;/strong&gt; A 503 from the target site at 02:14 kills the run. There is no second attempt. The next run is in 24 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No billing surface.&lt;/strong&gt; The cost of running it is your time and your server. There is no per-unit price, so there is no signal that the unit economics are bad until you check the AWS bill.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A script is fine for "I need this data once." It is not fine for "we need this data nightly for the next two years." But teams keep shipping #1 to fulfill #2.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an actor is
&lt;/h2&gt;

&lt;p&gt;Strip the marketing word and an actor is just: a containerised job with a declared input schema, a declared output schema, and a runtime that handles scheduling, retries, logs, persistent storage, and billing. Apify is one implementation — there are others. The shape matters more than the vendor.&lt;/p&gt;

&lt;p&gt;When we rebuilt our Bayut property scraper as an actor, four things changed at the level of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// 1. Input is validated against a schema before main() runs.&lt;/span&gt;
&lt;span class="c1"&gt;//    Bad input fails fast with a useful error, not silent miss.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Actor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getInput&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// INPUT_SCHEMA.json enforces shape&lt;/span&gt;

&lt;span class="c1"&gt;// 2. Output goes to a typed dataset. New fields require a schema&lt;/span&gt;
&lt;span class="c1"&gt;//    change — not a silent CSV widening.&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pushData&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="nx"&gt;listingId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;address&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;lat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;lng&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;scrapedAt&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// 3. Failures retry with backoff at the platform level.&lt;/span&gt;
&lt;span class="c1"&gt;//    Our code throws; the runtime decides what to do.&lt;/span&gt;
&lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ScrapeFailure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;listing-blocked&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// 4. Logs are structured, queryable, and indexed by run.&lt;/span&gt;
&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rate-limit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;retryAfter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Same Playwright, same selectors, same scraping logic. The difference is that all the boring infrastructure — input validation, output typing, retries, logs, scheduling, billing — is no longer your problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnljgnw2tyox3h6lvdcm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcnljgnw2tyox3h6lvdcm.png" alt="Fig. 1 — Concerns owned by the developer (script) vs. concerns owned by the runtime (actor)." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;For Bayut specifically, three months after the migration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mean time to detect a breakage&lt;/strong&gt; went from ~36 hours (next-day stakeholder complaint) to under 15 minutes (failed runs alert with the offending URL and HTTP status).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support tickets&lt;/strong&gt; dropped 70%. Most of the volume was "the data is missing" — invisible failures from the cron-script era. With per-run datasets, failed runs surface themselves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost per 1000 listings&lt;/strong&gt; went &lt;em&gt;down&lt;/em&gt;, not up. Concurrency at the runtime level is cheaper than spinning up your own queue.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The migration itself took about a week. Most of the time was not the scraping logic — that was already there. It was deciding what the input schema &lt;em&gt;should&lt;/em&gt; be, what the output schema &lt;em&gt;should&lt;/em&gt; be, and which fields were "nice to have" vs "the dataset is broken without this."&lt;/p&gt;

&lt;h2&gt;
  
  
  The replacement pattern
&lt;/h2&gt;

&lt;p&gt;If you're sitting on a script-shaped scraper right now, the migration order is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write the input schema. Force every run to declare what it's scraping.&lt;/li&gt;
&lt;li&gt;Write the output schema. Force every row to validate before it gets persisted.&lt;/li&gt;
&lt;li&gt;Move retries from &lt;code&gt;try/except: pass&lt;/code&gt; to the runtime.&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;print()&lt;/code&gt; with structured logs.&lt;/li&gt;
&lt;li&gt;Containerise. Whatever runs in &lt;code&gt;python main.py&lt;/code&gt; should run in &lt;code&gt;docker run&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pick a runtime — Apify, your own k8s cron, whatever. The schema work is portable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You do steps 1–5 inside your existing repo. You haven't committed to a vendor yet. By the time you reach step 6, the actor &lt;em&gt;exists&lt;/em&gt; — the runtime is just a deployment target.&lt;/p&gt;

&lt;p&gt;We packaged this migration shape into a starter we use for every new client engagement — same six steps that produced the &lt;a href="https://apify.com/sian.agency/bayut-property-scraper?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=jonas&amp;amp;utm_content=scripts-vs-actors-build-actors-instead" rel="noopener noreferrer"&gt;Bayut property scraper&lt;/a&gt; above. Same six steps, every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which of the five failure modes is currently shipping in your stack?&lt;/strong&gt; Drop it in the comments — I'll point at the smallest change that fixes it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Jonas Keller&lt;/em&gt;&lt;em&gt;, Senior Automation Architect at SIÁN Agency. Find more from Jonas on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=jonas&amp;amp;utm_content=scripts-vs-actors-build-actors-instead" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>automation</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>If Your Scraper Uses Regex on HTML, You're Already Broken</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Thu, 14 May 2026 14:30:00 +0000</pubDate>
      <link>https://dev.to/sian-agency/if-your-scraper-uses-regex-on-html-youre-already-broken-5d7h</link>
      <guid>https://dev.to/sian-agency/if-your-scraper-uses-regex-on-html-youre-already-broken-5d7h</guid>
      <description>&lt;p&gt;If your "scraper" is a &lt;code&gt;requests.get()&lt;/code&gt; followed by &lt;code&gt;re.findall(r'&amp;lt;div class=\"price\"&amp;gt;.*?&amp;lt;/div&amp;gt;', html)&lt;/code&gt;, I have bad news.&lt;/p&gt;

&lt;p&gt;You don't have a scraper. You have a layout sensor. The first time the dev team renames the class, adds a wrapper &lt;code&gt;&amp;lt;span&amp;gt;&lt;/code&gt;, or A/B tests a new pricing component, your pipeline goes silent. Not loud, not error-throwing — silent. Empty rows in the dataset. No alarm. You find out a week later when a stakeholder asks why the dashboard looks weird.&lt;/p&gt;

&lt;p&gt;I rebuilt our Idealista scraper this quarter and the regex stage was the thing I deleted first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-item checklist
&lt;/h2&gt;

&lt;p&gt;Before you write another &lt;code&gt;re.findall&lt;/code&gt; against HTML, check:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Is there a stable accessibility role or label?&lt;/strong&gt; (&lt;code&gt;getByRole('heading', { name: /price/i })&lt;/code&gt; — survives class renames.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the data actually in the rendered page, or is it injected via JSON?&lt;/strong&gt; (Often the JSON-LD &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; block has everything you need, no DOM walking.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can you assert the schema fails loud?&lt;/strong&gt; (If a field is missing, throw — don't silently default to &lt;code&gt;None&lt;/code&gt;.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answer to all three is no, you're not scraping. You're guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10-line replacement
&lt;/h2&gt;

&lt;p&gt;Here's the pattern I keep copying into new actors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_listing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Pull JSON-LD first — it's the spec, not the styling.
&lt;/span&gt;    &lt;span class="n"&gt;ld_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;script[type=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/ld+json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text_content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;offers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;offers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;priceCurrency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streetAddress&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ten lines. No regex. No CSS class names. No &lt;code&gt;BeautifulSoup&lt;/code&gt; chain that breaks when someone wraps the price in a new &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Why this works: JSON-LD is what Idealista, Bayut, and most listing sites publish for Google. It's stable because it's a contract with search engines, not with your scraper. When the visual layout changes, the JSON-LD almost always doesn't.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0icit1xkyur4ymr12w00.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0icit1xkyur4ymr12w00.png" alt="Fig. 1 — One pattern, three legitimate variants the regex doesn't match." width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick case
&lt;/h2&gt;

&lt;p&gt;Our Idealista actor went from 4 selector-related breakages per month to zero in the quarter after I switched extraction to JSON-LD + accessibility selectors. The breakages we still see are real changes — new property types, new fields — and they fail loud now, with a schema validation error, instead of silently returning empty strings.&lt;/p&gt;

&lt;p&gt;That's the bar: when the site changes, your scraper either keeps working or throws an error you can read. "Returns empty rows" is not acceptable behaviour.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CTA you didn't ask for
&lt;/h2&gt;

&lt;p&gt;This pattern is now the default starter for every actor we ship — visible in the &lt;a href="https://apify.com/sian.agency/smart-idealista-scraper?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=stop-using-regex-on-html" rel="noopener noreferrer"&gt;Idealista actor&lt;/a&gt;. Faster runs, fewer 3am Slack messages from clients asking why their CSV is half-empty. We turned the JSON-LD-first extractor into a reusable module that drops into any new actor in about a minute.&lt;/p&gt;

&lt;p&gt;So:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open your scraper. Search for &lt;code&gt;re.findall&lt;/code&gt;, &lt;code&gt;re.search&lt;/code&gt;, or &lt;code&gt;BeautifulSoup&lt;/code&gt; chained more than two &lt;code&gt;.find()&lt;/code&gt; deep.&lt;/strong&gt; Drop the worst offender in the comments — I'll show you the JSON-LD or selector replacement.&lt;/p&gt;

&lt;p&gt;Agree, disagree, or got a site where this falls apart? Reply.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Nova Chen&lt;/em&gt;&lt;em&gt;, Automation Dev Advocate at SIÁN Agency. Find more from Nova on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=stop-using-regex-on-html" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Rate Limits Are a Feature, Not a Bug</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Thu, 07 May 2026 05:39:33 +0000</pubDate>
      <link>https://dev.to/sian-agency/rate-limits-are-a-feature-not-a-bug-4lnm</link>
      <guid>https://dev.to/sian-agency/rate-limits-are-a-feature-not-a-bug-4lnm</guid>
      <description>&lt;p&gt;Most scraper "incidents" I'm pulled into start the same way: someone shows me a graph of 429 responses and asks how to make them go away. The honest answer — that nobody likes — is that &lt;strong&gt;the 429s are the well-behaved part of the system&lt;/strong&gt;. The rest is what's broken.&lt;/p&gt;

&lt;p&gt;I'm going to argue that rate limits are not your enemy. They're a contract. And scrapers that treat them like a contract — instead of an obstacle — are the only ones I trust to run unsupervised for more than a quarter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The teardown
&lt;/h2&gt;

&lt;p&gt;Three things teams typically do when they hit rate limits, in order of how bad they are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add proxies.&lt;/strong&gt; "If they limit &lt;em&gt;me&lt;/em&gt;, I'll just &lt;em&gt;be more people&lt;/em&gt;." This works for about six weeks. Then the target site fingerprints your residential proxy pool and you're back to where you started, with a higher monthly bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decrease delays.&lt;/strong&gt; "If we go faster, we'll finish before they notice." Faster only matters if the request budget exists. Going faster against a hard limit just stacks failures earlier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry harder.&lt;/strong&gt; Add exponential backoff with a 30-minute cap. Now your "1-hour scraper" is a 4-hour scraper that completes when the throttle window expires.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All three are forms of the same denial: refusing to accept that the source site is telling you the rate at which they're willing to serve you data. They are. You should listen.&lt;/p&gt;

&lt;h2&gt;
  
  
  What rate limits actually are
&lt;/h2&gt;

&lt;p&gt;A rate limit is the source-site engineer's way of saying: &lt;em&gt;here is the contract under which my system stays healthy&lt;/em&gt;. They published the rate (often: in headers) because they've measured what their infrastructure can serve before things degrade. When you exceed it, you don't just hurt yourself — you contribute to the conditions that get scrapers blocked entirely.&lt;/p&gt;

&lt;p&gt;There are three signals you should be reading from every response, not just the body:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Retry-After&lt;/code&gt; header.&lt;/strong&gt; This is the source telling you, in seconds, when it'll talk to you again. Respect it literally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;X-RateLimit-Remaining&lt;/code&gt; (or equivalent).&lt;/strong&gt; Some sites publish their budget. Use it. Slow down before you hit zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status code distribution over time.&lt;/strong&gt; If your 200 rate is dropping while 429 rises, you're approaching a soft limit you can't see. Back off proactively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're not reading those, your scraper is operating blind against an opponent who is leaving the lights on for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The replacement pattern
&lt;/h2&gt;

&lt;p&gt;Here's the rate-aware request loop I drop into every actor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RateBudget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Token bucket — refills at `rate` per second, max `burst` tokens.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;burst&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;burst&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;burst&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;burst&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_refill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;burst&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_refill&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_refill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;429&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;retry_after&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Retry-After&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;60&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry_after&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things this does that "decrease the delay" doesn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Token bucket means the rate is global, not per-request.&lt;/strong&gt; Concurrency works without exceeding the contract.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Retry-After&lt;/code&gt; is honoured literally.&lt;/strong&gt; No exponential backoff guessing — the source already told you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No proxy rotation.&lt;/strong&gt; You don't need to be more people. You need to be one &lt;em&gt;well-behaved&lt;/em&gt; person.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufxyhfiqg2szelf6c9a3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufxyhfiqg2szelf6c9a3.png" alt="Fig. 2 — Aggressive vs polite request rate over time on the same workload. Same code, different contract with the source." width="800" height="597"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the two scrapers I migrated to this pattern this quarter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idealista.&lt;/strong&gt; 429 rate dropped from 8% to 0.4%. Total run time went &lt;em&gt;up&lt;/em&gt; by 11% (from 47min to 52min average) — because we stopped hammering. Per-run cost went &lt;em&gt;down&lt;/em&gt; 38% — because we stopped paying for retries that were never going to succeed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sephora.&lt;/strong&gt; 429 rate from 15% to &amp;lt;1%. Run time about the same. Block rate (full IP block requiring rotation) went from "monthly" to "zero in the last 90 days." This one's the real win — we used to burn a residential proxy pool subscription. Now we don't need it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern that emerges every time: &lt;strong&gt;respecting the rate makes you slower per-request, but more reliable per-run, and significantly cheaper per-result.&lt;/strong&gt; The unit economics of a polite scraper beat the unit economics of an aggressive one. By a lot.&lt;/p&gt;

&lt;h2&gt;
  
  
  When it's wrong
&lt;/h2&gt;

&lt;p&gt;This is wrong if the source site doesn't publish a contract — no &lt;code&gt;Retry-After&lt;/code&gt;, no rate header, just blanket blocks. There you genuinely are guessing. But the guess should still bias toward "much slower than you think you need to be," not toward "more proxies." A token bucket at 1 req/sec is a fine starting point for an unknown site; you can ratchet up while watching error rates.&lt;/p&gt;

&lt;p&gt;This is also wrong if you have explicit business permission to scrape at higher rates — a partnership, an API key, a contract. Those are different relationships. The advice here is for scrapers running against the public web, where 429 is the only contract you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Stop thinking of rate limits as the cost of doing business. Start thinking of them as a free service the target site is providing you: telling you exactly how to stay welcome. Most blocked scrapers I see were blocked not because they "got caught" — they were blocked because they ignored repeated, clearly-articulated signals that they were being rude.&lt;/p&gt;

&lt;p&gt;We packaged the token bucket + &lt;code&gt;Retry-After&lt;/code&gt; honour into a small middleware that sits in front of every actor we ship — visible across our &lt;a href="https://apify.com/sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=jonas&amp;amp;utm_content=rate-limits-are-a-feature" rel="noopener noreferrer"&gt;Apify portfolio&lt;/a&gt;. About 30 lines of code. It's the most boring reliability win I've shipped this year, and the most consistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which response header is your scraper currently ignoring?&lt;/strong&gt; Drop it in the comments — I'll show you what to do with it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Jonas Keller&lt;/em&gt;&lt;em&gt;, Senior Automation Architect at SIÁN Agency. Find more from Jonas on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=jonas&amp;amp;utm_content=rate-limits-are-a-feature" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>api</category>
      <category>architecture</category>
      <category>softwareengineering</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Instagram Reel Transcripts in 5 Lines — and Word-Level Timestamps Are Free</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Sat, 02 May 2026 04:54:03 +0000</pubDate>
      <link>https://dev.to/sian-agency/instagram-reel-transcripts-in-5-lines-and-word-level-timestamps-are-free-3d7a</link>
      <guid>https://dev.to/sian-agency/instagram-reel-transcripts-in-5-lines-and-word-level-timestamps-are-free-3d7a</guid>
      <description>&lt;p&gt;If you've ever priced Instagram transcription at scale, you already know the trap: per-video pricing on the SaaS tier, plus an upcharge for word-level timestamps. Run the math on 500 reels and you'll close the tab.&lt;/p&gt;

&lt;p&gt;I'm not going to talk you out of building your own pipeline. I'm just going to show you the five lines I run when I don't want to.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap: per-URL pricing on transcript metadata
&lt;/h2&gt;

&lt;p&gt;Most Instagram transcription APIs in 2026 charge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A base rate per processed video.&lt;/li&gt;
&lt;li&gt;Sometimes a separate rate per minute of audio.&lt;/li&gt;
&lt;li&gt;An &lt;em&gt;additional&lt;/em&gt; fee to expose word-level timestamps (the thing you actually need if you're building captions, search, or any kind of clip editor).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That works for a single creator's library. It does not work for an agency processing client A's 200 reels, then client B's 1,000.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five-line replacement
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sian.agency/instagram-ai-transcript-unlimited&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bulkUrls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.instagram.com/reel/DG06PnPT9aT/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wordLevelTimestamps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;())[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three input fields you actually need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;instagramUrl&lt;/code&gt; (string) — single reel or video post. Pattern enforced; &lt;code&gt;/reels/&lt;/code&gt; auto-corrects to &lt;code&gt;/reel/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bulkUrls&lt;/code&gt; (array) — paste 1, paste 1,000. Bulk edit, .txt upload, manual list. Same input shape regardless of volume.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;wordLevelTimestamps&lt;/code&gt; (boolean, default &lt;code&gt;true&lt;/code&gt;) — get a per-word timestamp on every transcript. &lt;strong&gt;Free.&lt;/strong&gt; You don't pay extra for it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That third one is the point of this post. It's on by default. Most tools hide it behind a paywall. This one doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can't transcribe
&lt;/h2&gt;

&lt;p&gt;Be honest about the constraints up front:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Image carousels&lt;/strong&gt; — no audio, nothing to transcribe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Music-only videos&lt;/strong&gt; — no spoken audio, the transcript will be empty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private profiles&lt;/strong&gt; — Instagram blocks scraping public-side, so the actor only handles public reels and posts.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're building a "scrape any Instagram URL" feature, you'll hit those edges. The actor returns a clear error per URL — handle it client-side and skip silently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "unlimited" is a real claim, not marketing
&lt;/h2&gt;

&lt;p&gt;The actor doesn't charge per validated URL. It charges for compute time per run. If you're processing 1,000 reels in one batch, that's one run. The pricing model rewards batching, which is what you want anyway — bulk is faster than serial because the runtime queue stays warm.&lt;/p&gt;

&lt;p&gt;I migrated an agency client's Instagram audit workflow last week. Old setup: a per-video API at $0.05 + $0.02 word-timestamp upcharge — $35 for 500 reels per audit. New setup: one bulk run, predictable monthly compute. Roughly 1/4 the cost at their volume, and the dataset shape is identical.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do next
&lt;/h2&gt;

&lt;p&gt;If you want to see what 30+ data points + word-level transcripts look like for your own client list, run it once: &lt;a href="https://apify.com/sian.agency/instagram-ai-transcript-unlimited?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=instagram-reel-transcripts-word-timestamps" rel="noopener noreferrer"&gt;Instagram AI Transcript Unlimited&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Single-URL test costs less than a coffee. Bulk run is unlimited.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tell me where this breaks.&lt;/strong&gt; If you've found a public reel format the URL pattern misses, drop it in the comments. I'll get the maintainer to ship a fix in the next build.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Nova Chen&lt;/em&gt;&lt;em&gt;, Automation Dev Advocate at SIÁN Agency. Find more from Nova on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=instagram-reel-transcripts-word-timestamps" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>api</category>
      <category>automation</category>
      <category>socialmedia</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>I Stopped Writing TikTok Scrapers. Five Lines of Python Replaced Them.</title>
      <dc:creator>SIÁN Agency</dc:creator>
      <pubDate>Mon, 27 Apr 2026 13:34:57 +0000</pubDate>
      <link>https://dev.to/sian-agency/i-stopped-writing-tiktok-scrapers-five-lines-of-python-replaced-them-5824</link>
      <guid>https://dev.to/sian-agency/i-stopped-writing-tiktok-scrapers-five-lines-of-python-replaced-them-5824</guid>
      <description>&lt;p&gt;If your TikTok scraper still uses Playwright + custom selectors, this post will annoy you. Good. Read it anyway.&lt;/p&gt;

&lt;p&gt;I burned three weekends last quarter on a "minimal" TikTok scraper. Selector-first, headless, the works. Worked beautifully for nine days. Then TikTok shipped a layout change at 2am UTC and my fixtures became fiction.&lt;/p&gt;

&lt;p&gt;The honest answer most devs avoid: &lt;strong&gt;for known platforms with stable APIs around them, you should not be writing the scraper.&lt;/strong&gt; You should be calling someone's actor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stop owning the layer that breaks
&lt;/h2&gt;

&lt;p&gt;Three things break a TikTok scraper, and none of them are about your code:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Layout drift.&lt;/strong&gt; Selectors are a liability the second TikTok touches the DOM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth + rate-limit games.&lt;/strong&gt; Cloudflare, fingerprinting, the whole party.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio extraction + transcription.&lt;/strong&gt; Even if you got the video, now you need Whisper, ffmpeg, a queue, and a dead body to bury when it OOMs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You're not getting paid to maintain that. You're getting paid to ship the thing on top of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What replaced 800 lines of Python for me
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ApifyClient&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_APIFY_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sian.agency/best-tiktok-ai-transcript-extractor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bulkUrls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.tiktok.com/@user/video/7565659068153531669&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;defaultDatasetId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;iterate_items&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole thing. Five lines. The actor's input schema has exactly two fields you need to know about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tiktokUrl&lt;/code&gt; (string) — single video. Pass any URL format. Short links from &lt;code&gt;vm.tiktok.com&lt;/code&gt; get resolved. Mobile share URLs work.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bulkUrls&lt;/code&gt; (array) — paste 5, 50, or 500. Bulk edit, file upload, line-separated, comma-separated. It doesn't care.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the entire input surface. Two keys. No proxy config, no captcha settings, no "headless or headful" debate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you get back
&lt;/h2&gt;

&lt;p&gt;Per video, you get the AI transcript (99%+ accuracy claimed by the actor — empirically I see ~98% on English, lower on heavy slang) plus 45 metadata fields: views, likes, shares, creator stats, hashtags, music ID, location, content categories. The transcript ships with detected language and segment timing, so you can search inside videos like text.&lt;/p&gt;

&lt;p&gt;I rewrote a competitor-monitoring pipeline last month using this. Old stack: Playwright cluster + Whisper container + Redis + a cron + a Slack channel where I apologized weekly. New stack: a 60-line Python script and the actor. Same dataset, less surface area, no apologies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The objection I keep getting
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"Why pay per run when I can self-host?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because your time isn't free, and you don't actually self-host — you self-rebuild every two weeks when something shifts. The actor charges per validated result. You only pay for the runs that gave you usable data. That's a different cost model than "compute hours your worker spent crashing."&lt;/p&gt;

&lt;p&gt;If your volume is genuinely huge, sure, build it. But "huge" is an engineering decision, not a default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it on your own URL
&lt;/h2&gt;

&lt;p&gt;The free tier handles 5 videos per run, 8s delay between them. If you want to see the dataset shape for your own use case, drop a TikTok URL in and watch it run: &lt;a href="https://apify.com/sian.agency/best-tiktok-ai-transcript-extractor?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=tiktok-transcripts-5-lines-python" rel="noopener noreferrer"&gt;TikTok AI Transcript Extractor on Apify&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Bulk mode is paid — unlimited per run, no delays, no per-video charges. Use it when you're past the experiment phase.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Disagree?&lt;/strong&gt; Drop the snippet you're using to scrape TikTok in the comments. I'll tell you which line is going to break first. Be specific — "I use Puppeteer" is not a snippet.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written by **Nova Chen&lt;/em&gt;&lt;em&gt;, Automation Dev Advocate at SIÁN Agency. Find more from Nova on &lt;a href="https://dev.to/sian-agency"&gt;dev.to&lt;/a&gt;. For custom scraping or automation work, &lt;a href="https://sian.agency?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=nova&amp;amp;utm_content=tiktok-transcripts-5-lines-python" rel="noopener noreferrer"&gt;hire SIÁN Agency&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
