DEV Community: SIÁN Agency

Someone Ran My Scraper 1,251 Times and Paid Me Nothing

SIÁN Agency — Mon, 06 Jul 2026 15:10:15 +0000

Someone ran one of my public actors 1,251 times in four days. I earned roughly enough to cover a sandwich.

If your first instinct is "billing bug," you're going to lose this game. Mine wasn't broken. The runs were real, the results were real, and every single one was engineered to pay me nothing.

The spike you'll mistake for success

Here's what the dashboard showed: a wall of runs. More traffic in four days than the actor saw the entire previous month. Glance at the top-line chart and that's a growth story. You screenshot it. You feel good.

Then you look at who. In four days: 1,251 runs across 761 different accounts. Almost all on the free plan. Every request through the API, none through the UI. Around 30,000 results pulled.

That same actor had 13 real users the entire previous month.

Free traffic is not free to you

This is where the mental model breaks. A free-tier user isn't a marketing cost you eat for goodwill. On a scraper, every run I serve calls a paid data API upstream, one I pay for per request. So 30,000 free results is 30,000 units of my quota, spent, for a rounding error in revenue.

The attacker pays nothing. I pay the upstream. That's the entire play.

The two-fingerprint tell

761 accounts sounds like 761 people. It wasn't. When I grouped the runs by session fingerprint, all 1,251 collapsed into two. One fingerprint owned 1,050 runs, the other 183.

That's not 761 users. That's one operator running a script that mints throwaway accounts. The account count is noise. The fingerprint count is the truth. The moment you see two, you stop investigating and start defending.

Why your billing looked fine the whole time

The trap: I checked the money first. Charged events matched the platform's own numbers to the cent. Everything reconciled. So for an hour I assumed the system was working as intended — because technically it was. Billing was flawless. The abuse lived one layer up, in who was allowed to trigger a paid run at all.

Reconciling your billing tells you nothing about whether you should have run the job.

The brake I should have shipped first

The fix is small and boring, which is how you know it's right. Gate the expensive work behind a paid-plan check that fails before the first upstream call:

// Reject free-tier runs before we spend a cent of upstream quota.
const isPaying =
  onPlatformPaidFlag === true ||        // a real paying user on the platform
  (!isOnPlatform && testOverride);      // local/dev smoke test only

if (!isPaying) {
  throw new Error("Paid plan required — no upstream call, no charge.");
}
// ...only paying runs reach the expensive part

Fast-failing here means the farm gets an immediate error and burns zero quota. It's reversible: one block, deleted the day the farm gives up. No refactor, no rewrite.

The one gotcha before you copy this

Don't gate blindly. The platform runs its own automated test of your public actor on a non-paying run to keep it visible in the store. Reject that too and your listing quietly goes empty and sinks in the rankings. Whitelist the platform's own test origin, then throw on everyone else.

Do this today

Ship the brake before you need the forensics. I did it in the wrong order so you don't have to.

If you run public actors, go look at your last big traffic spike right now. Group it by fingerprint, not by account. Tell me what you find in the comments. And if you'd rather your scrapers shipped with the brake already wired in, that's the kind of thing we build at SIÁN Agency.

Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency.

Replayable Runs > Faster Runs. Stop Optimising for the Wrong Number.

SIÁN Agency — Mon, 06 Jul 2026 09:00:00 +0000

Most "we made it 3x faster" scraper posts miss the actual point. Speed is rarely the constraint. Replayability is.

If your scraper takes 4 hours to run and a single URL fails halfway through, can you re-run just that URL in 30 seconds? Or do you have to start the whole 4-hour job over?

If the answer is "start over," you don't have a scraper. You have a long-running prayer.

The 3-item checklist

A replayable run looks like this:

Inputs are explicit and persisted. Every URL/parameter the run is processing is written to a queue or dataset before it starts. You can re-read the input list later.
Outputs are addressable per input. You can ask "did URL X succeed?" and get a yes/no, not "well, the run finished, so probably."
Failures are first-class records. Failed inputs go to a separate dataset/queue with the error reason, ready to feed back into a retry run.

When all three hold, "rerun the failures" is a one-liner. When any of them is missing, recovery is manual archaeology.

The trick — input/output as separate datasets

Here's the shape:

from apify import Actor, Dataset
from apify.storages import RequestQueue

await Actor.init()
input_data = await Actor.get_input()

# 1. Push inputs into a queue. Idempotent — re-runs skip already-done items.
queue = await RequestQueue.open(name="podcast-urls-2026-06")
for url in input_data["urls"]:
    await queue.add_request({"url": url, "uniqueKey": url})

# 2. Process the queue, splitting outputs into success and failure datasets.
results = await Dataset.open(name="podcast-results-2026-06")
failures = await Dataset.open(name="podcast-failures-2026-06")

while (request := await queue.fetch_next_request()):
    try:
        record = await transcribe(request["url"])
        await results.push_data(record)
        await queue.mark_request_as_handled(request)
    except Exception as e:
        await failures.push_data({
            "url": request["url"],
            "error": str(e),
            "failed_at": datetime.utcnow().isoformat(),
        })
        await queue.mark_request_as_handled(request)  # don't retry blindly

Three storages: input queue, success dataset, failure dataset. The queue is keyed by URL, so adding the same URLs again is a no-op. The failure dataset is the input for the next retry run.

Quick case

The podcast transcription actor used to be a 6-hour batch job. When a single episode failed (audio download timeout, transcription model glitch, anything), the recovery story was: "find the failed URL in the logs, hand-craft a one-URL run, hope the second try works."

After moving to the queue + split-dataset pattern:

Failed URLs are visible in a dedicated dataset, with the error and timestamp.
"Retry yesterday's failures" is one button: open the failures dataset, push its rows into a new run as input.
The original run's success dataset doesn't get re-processed — it just gets appended to.

What used to take 30 minutes of manual triage is now a 30-second action. Same scraper, same selectors, same model — different runtime structure.

The CTA you didn't ask for

The queue + success-dataset + failure-dataset pattern is the third thing every actor we ship gets, after request blocking and selector ladder — visible in the podcast transcription actor. (We have a starter template now. Same shape every time.)

So:

Open your scraper. If a single URL fails, what does recovery look like? If your answer takes more than one paragraph, drop it in the comments — I'll show you the smaller version.

Agree, disagree, or have a recovery story that doesn't need this? Reply.

Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

I Rewrote Our Instagram Transcript Actor for Pay-Per-Event Pricing. The Economics Flipped.

SIÁN Agency — Thu, 25 Jun 2026 03:00:00 +0000

TL;DR — Moved an Instagram transcript actor from pay-per-result to pay-per-event billing. Three events, not one number. Margin held, retries stopped silently bleeding cash, and the actor is now honest about what it's actually charging for. If your scraper has a "credits" tab in the README, this is for you.

For a year I shipped scrapers the same way everyone does: one big knob — pay-per-result, $X per item, computed at the end of the run. It looked clean from the README. It was a mess underneath.

The actor would start, spin up a browser, hit Instagram, run into a transient block, retry, succeed on three out of ten URLs, and spit back a number. The user paid for three. We absorbed the cost of the seven retries, the cold start, and the GPU minutes the transcription model burned on partial audio. On a good week the unit economics worked. On a bad week — when Instagram changed something and our success rate dropped to 60% — we paid Apify and OpenAI for the privilege of running a free service.

That's the trap pay-per-result puts you in. Your price is fixed. Your cost isn't.

The teardown

Pay-per-result conflates three different things into one transaction:

Setup work — booting the actor, validating input, warming the browser. Happens once per run regardless of how many URLs you pass.
Per-item work — fetching the post, extracting media, calling the transcription model. Scales linearly with input.
Optional premium work — the fast-processing path that costs us more per item but the user explicitly asked for.

Charging one rate for "a result" forces you to subsidise items #1 and #3 out of the margin on item #2. When users bulk-submit URLs, item #1 amortises and you're fine. When they submit one URL at a time, you eat the setup cost on every run. When they enable fast processing on every call, you eat the premium delta on every call.

Apify's pay-per-event model lets you charge for each of these separately. So we did.

The replacement pattern

The new actor declares three events in actor.json:

"monetization": {
  "events": [
    { "name": "ActorRunStarted",            "price": 0.005 },
    { "name": "InstagramContentProcessed",  "price": 0.018 },
    { "name": "FastProcessingUpgrade",      "price": 0.002 }
  ]
}

Then in the actor body, you charge against those events at the moment the work is actually done:

import { Actor } from 'apify';
await Actor.init();

await Actor.charge({ eventName: 'ActorRunStarted' });

for (const url of input.bulkUrls) {
  try {
    const result = await processInstagramPost(url, input.fastProcessing);
    await Dataset.pushData(result);

    // Only charge per item on success.
    await Actor.charge({ eventName: 'InstagramContentProcessed' });

    if (input.fastProcessing) {
      await Actor.charge({ eventName: 'FastProcessingUpgrade' });
    }
  } catch (err) {
    // Failed items don't bill the user. They also don't bleed margin
    // because the run-started fee already covered the setup.
    log.warning(`Skipping ${url}: ${err.message}`);
  }
}

await Actor.exit();

Three lines of policy:

Run starts always bill. $0.005 covers boot. Doesn't matter if zero items succeed.
Per-item billing only fires after pushData. Failures are free for the user — and free of margin loss for us, because we already covered fixed cost.
Premium path bills on top. If the user opted into fast processing, that delta is charged separately and visibly.

Result

Three months in:

Margin per run stopped going negative on small-batch / high-failure runs. The run-started fee acts as a floor.
Failed-URL ratio dropped from 12% to 4% — not because we got better, but because we stopped hiding failures behind a flat result fee. Users started reporting bad URLs in support, instead of opening refund tickets.
Average revenue per user went up, not down, even though our headline price ($0.018/item) was lower than the previous flat $0.025. Setup fee + opt-in premium fee made up the difference.

Cleaner pricing, cleaner margin, cleaner conversation with users about what they're actually paying for.

If you're running an Apify actor on flat pay-per-result and your retry rate is anything above noise, you're subsidising the unreliable part of your stack. Move the line. Charge for what you do, not for what survives. The Instagram actor I rewrote with this model is live at Instagram AI Transcript Extractor — same shape applied across the rest of our actor portfolio over the last quarter.

What event are you not charging for that you should be? Drop the actor in the comments — I'll look at the schema.

Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Migration Playbook: Cron Script Actor. Six Steps, No Rewrites.

SIÁN Agency — Mon, 22 Jun 2026 02:30:00 +0000

TL;DR — Migrating a long-running cron-based scraper to an actor architecture does not require a rewrite. It requires six structural changes applied in order. Each one is independently shippable. Each one moves the scraper closer to a state where infrastructure is no longer your problem. We migrated our interview transcription pipeline this way over four iterations. Here's the order I'd run it again.

I've watched too many migrations fail because someone said "let's rewrite this as an actor" and treated it as a from-scratch project. They burn a sprint, miss edge cases the original handled, ship something that works in dev and breaks under real input, and end up reverting.

The pipeline doesn't need a rewrite. It needs surgery.

The six-step migration order

Each step is a PR. Each step ships independently. The cron job keeps running until step 6.

Step 1 — Extract input from the script body

Find the hardcoded list of URLs / config values / paths in your script. Move them to a JSON config file. The script reads from the config; the config is parameterised.

# Before:
URLS = ["https://...", "https://..."]
OUTPUT_PATH = "/var/data/output.csv"

# After:
import json, sys
config = json.load(sys.stdin)
urls = config["urls"]
output_path = config["output_path"]

The cron now does cat config.json | python script.py instead of python script.py. Behaviour identical. Surface area changed.

Why first: every later step depends on having a typed input. Doing this first means everything that follows operates on the same shape.

Step 2 — Replace ad-hoc output with a structured dataset

Instead of writing rows directly to a CSV, push them to a function that wraps the persistence layer:

def push_record(record):
    # Today, this writes to a CSV.
    write_csv_row(output_path, record)

# Tomorrow, this writes to Apify Dataset, S3, BigQuery, whatever.

Same data shape, abstracted writer. The cron still produces a CSV. Step 6 swaps the writer.

Why second: schema changes are easier when there's one place that knows about the shape.

Step 3 — Replace bare `try/except` with structured failures

Audit every try/except. If it swallows the exception, replace with explicit logging and a failure record:

# Before:
try:
    record = process(url)
except Exception:
    pass

# After:
try:
    record = process(url)
except Exception as e:
    push_failure({"url": url, "error": str(e), "type": type(e).__name__})
    continue

Now failures are first-class data. Same rows of work; the bad ones go to a different file (or dataset) instead of vanishing.

Why third: this is the step where you stop losing data silently. Every later step assumes failures are visible.

Step 4 — Replace `print()` with structured logging

# Before:
print(f"Processing {url}")

# After:
import logging
log = logging.getLogger("transcribe")
log.info("processing", extra={"url": url, "stage": "transcribe"})

Use a logging library that supports structured fields. (Python: structlog, loguru, or logging with a JSON formatter.)

Why fourth: logs are what step 6 will be reading. They need shape.

Step 5 — Containerise

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "script.py"]

The cron now runs docker run my-scraper. Same input/output. Containerised.

Why fifth: containerisation is portable. Step 6 needs it; nothing earlier did.

Step 6 — Swap the runtime

Now you point the container at an actor runtime — Apify, Kubernetes CronJob, Cloud Run, whatever. The container is the same. The cron entry is gone. Scheduling, retries, logging, persistence are now provided by the runtime.

This is the step that takes a day. Steps 1–5 might take 2–3 days each. The point of doing them first is that step 6, the one most teams treat as the whole migration, is small and reversible by the time you reach it.

Why this order

Each step is independently valuable even if you stop. After step 1 you have a parameterised script — useful for ad-hoc runs. After step 2 you can change persistence. After step 3 you stop losing data. After step 4 you can debug remotely. After step 5 you can deploy anywhere. After step 6 you have an actor.

If a stakeholder asks why the rewrite is taking so long, you can point at the running improvements at any step. There is no "we're 60% done with the rewrite, it's not running yet" phase.

Result

The interview transcription actor went through this migration over four months, one step at a time, while running in production the entire time. Pre-migration: ad-hoc cron, 18% silent-failure rate, mean time to detect issues ~24 hours. Post-migration: actor with retries and structured logging, 1.2% failure rate (and the failures are visible), mean time to detect <30 minutes.

Total team-hours: roughly 60. Spread across four iterations. Compare to the rewrites I've seen go sideways: typically 80–120 hours and a stalled cutover.

When this is wrong

Two cases where a rewrite genuinely beats the migration playbook:

The original script is very small (under 100 lines, single function). At that scale, the migration steps cost as much as a rewrite, and the rewrite gives you a cleaner result.
The original is in a language your team doesn't maintain (a Perl script you inherited, a Bash pipeline). At some point the cost of step 1 alone exceeds the rewrite cost.

Otherwise: surgery, not rewrite. We packaged this six-step migration as a checklist we apply to every legacy scraper an engagement starts with — same shape we used to rebuild the interview transcription actor.

Where in the six steps is your current scraper? Drop the answer — I'll point at the next change that buys you the most.

Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

I Broke This Scraper on Purpose. Here's What Shipped to Production Unprotected.

SIÁN Agency — Thu, 11 Jun 2026 08:30:00 +0000

Best way to find out where your scraper is fragile? Break it. On purpose. In a controlled way, in a test environment, with a checklist of failure modes you actively try to inject.

This is chaos engineering for scrapers. Most teams don't do it because they're convinced their scraper "works." Then they discover what doesn't work the hard way, in production, on a Sunday.

I ran the exercise on our image metadata scraper last week. Here's what I broke and what I found.

The 3-item attack list

Three categories of injected failure that catch most fragility:

Network failure — slow responses, dropped connections, partial bodies, 5xx responses.
Content failure — malformed HTML, missing fields, unexpected types (string where number was expected).
Adversarial input — empty inputs, very large inputs, URLs that 404, URLs that redirect to login pages.

If your scraper survives all three, you have a real scraper. If it crashes or hangs on any of them, you've found a bug.

The trick — Playwright route handlers as fault injectors

Playwright's request routing isn't just for blocking ads. It's a controlled chaos primitive:

// Inject a 30% rate of 503 responses
await page.route('**/*', async (route) => {
  if (Math.random() < 0.3) {
    return route.fulfill({
      status: 503,
      body: 'Service Unavailable',
    });
  }
  return route.continue();
});

// Inject latency
await page.route('**/api/*', async (route) => {
  await new Promise(r => setTimeout(r, 5000));
  return route.continue();
});

// Inject malformed JSON
await page.route('**/metadata.json', async (route) => {
  return route.fulfill({
    status: 200,
    contentType: 'application/json',
    body: '{"title": "Test", "size": ',  // truncated JSON
  });
});

Now run your scraper. See what falls over.

What broke when I did this last week

Image metadata scraper, running against a fixture set of 100 URLs with the failure handlers above wired in:

503 injection at 30% → scraper hung on a single URL for 90 seconds before failing. Found: missing per-request timeout. Fix: 15-second hard timeout per page.
5-second latency injection → scraper completed but reported 0 results for affected URLs. Found: wait_for_selector had an implicit 5-second timeout that exactly matched the injected latency, so it failed silently. Fix: explicit timeout, longer than expected p99 page load.
Truncated JSON injection → uncaught JSONDecodeError, killed the entire run. Found: no try/except around the JSON parser. Fix: wrap in try/except, push to failures dataset (per last week's post).
Empty input array → scraper exited with code 0 and an empty dataset. Found: no validation of input shape. Fix: assert len(input.urls) > 0 at start.
404 URLs (mixed in with valid URLs) → scraper retried each three times before giving up, doubling run time. Found: 404 was being treated as transient, not permanent. Fix: 404 → push to failures immediately, no retry.

Five real bugs, found in 90 minutes. Every one of them would have eventually hit production. Two of them already had — the timeout one was the cause of a Slack alarm we got in March that we'd "fixed" by restarting the actor.

The CTA you didn't ask for

We now run a chaos test suite against every actor before it ships. Same five injections every time:

Random 503s at 30%.
Random 5s latency at 20%.
Malformed JSON on the data endpoint.
Empty input array.
50% invalid URLs in the input.

It takes 5 minutes to run, and it catches things real-traffic testing won't, because real traffic doesn't reliably produce the bad cases. The chaos suite is what caught the timeouts in the image metadata scraper before its first paying user noticed.

So:

Pick one of the five injections above. Run it against your scraper today. Drop what broke in the comments — I'll guess the failure mode if you give me one detail.

Agree, disagree, or have a chaos test that catches something subtler? Reply.

Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Schema Drift Is the Silent Killer. Here's What to Log So You Actually Catch It.

SIÁN Agency — Tue, 02 Jun 2026 09:00:00 +0000

TL;DR — Most scraper "bugs" aren't bugs. They're the source site changing its data shape underneath you while your selectors and your code keep returning success. This is schema drift, and you cannot prevent it. You can only detect it. The detection has to be designed in. Here's how we do it.

I have a low opinion of any scraper that does not log a per-field availability rate. It's the single most useful number you can produce, and almost nobody produces it.

The premise: every record you scrape has a set of expected fields. After every run, you compute, for each field, the percentage of records that had a non-null value for it. You log that number. You alarm on it.

That's it. That's the whole technique.

Why this matters

A scraper has three failure modes you actually care about:

Total failure — the run errors out, you get a stack trace, you fix it.
Partial failure — some URLs fail, you log them, you retry.
Schema drift — every URL "succeeds," every record looks fine, but a field has silently gone from 98% present to 30% present.

The first two are loud. The third is silent. Schema drift is what produces "the dashboard looks weird" support tickets a week after the cause.

Real example, from our Sephora product info actor: in March, the site moved the "ingredients" field from a top-level dropdown into a tab inside a modal. Our existing selector still found something on the page — a placeholder div — and our code happily wrote ingredients="" to the dataset. No error, no alarm. The CSV had ingredient column. The values were empty for new products. Detected eight days later by a customer who tried to filter by allergen.

If we had been logging field availability, we would have seen the ingredient field drop from 96% present to 11% present in a single deploy and caught it inside an hour.

The teardown of why this gets missed

Most scrapers track:

Rows extracted per run.
Errors per run.
Run duration.

None of those move when schema drift happens. The row count is the same. The error rate is zero. The run duration is the same. You have to be looking at field-level data to see it.

The replacement pattern

After every run, compute and log this:

from collections import Counter

def field_availability(records, expected_fields):
    """Returns the % of records where each field is non-null."""
    counts = Counter()
    total = len(records)
    for record in records:
        for field in expected_fields:
            if record.get(field) not in (None, "", []):
                counts[field] += 1
    return {field: round(counts[field] / total * 100, 1) for field in expected_fields}

At the end of the run:

availability = field_availability(records, EXPECTED_FIELDS)
log.info("field_availability", extra=availability)

# Alarm on regression vs last run.
prev = await KeyValueStore.getValue("last_field_availability") or {}
for field, pct in availability.items():
    delta = pct - prev.get(field, pct)
    if delta < -10:  # 10-point drop is suspicious
        log.warning(f"availability regression: {field} {prev[field]}% → {pct}%")
await KeyValueStore.setValue("last_field_availability", availability)

Three log lines per run. Persistent state across runs. An alarm when any field drops more than 10 percentage points.

What to monitor specifically

Field availability is the one that catches the most. Two more I find pay for themselves:

Value distribution shift. For numeric fields (price, rating, count), log the median and p95. If price suddenly goes from "median ~$30" to "median 0.0" you have a parser bug, not just availability drift.
Selector hit count. When you fall back from primary to secondary selector, log it. If your fallback rate goes from 1% to 40%, the primary selector is on its way out — you have a week or so before it goes to zero.

These three together (availability, distribution, fallback rate) catch ~90% of schema drift before it produces customer-visible bugs.

Result

We added per-field availability logging across the Sephora actor portfolio in February. In the four months since:

6 schema-drift incidents caught and fixed within 48 hours of the source-site change.
Mean detection lag went from "a customer noticed" (~7 days) to "the alarm fired" (~12 hours, the gap being our run cadence).
One incident where the field availability dropped in a way that was expected (Sephora removed a field site-wide); we acknowledged and updated the schema. Net cost: 20 minutes, including writing the postmortem.

The cost: about 30 lines of code per actor, run-time overhead measured in milliseconds.

When this is wrong

Field availability is a poor signal when your input is inherently heterogeneous. If you're scraping listings where some products have ingredients and most don't, "30% have ingredients" might be normal. The technique still works — you just compare to the previous run, not to an absolute target. A 10-point drop is the alarm; the absolute number doesn't matter.

If you're scraping a homogeneous catalogue (every product has a title and a price), absolute thresholds work fine. Title <99% present? Something is wrong.

We packaged the field-availability + distribution + fallback-rate triple into a small middleware that sits at the end of every actor we ship — first deployed on the Sephora product info actor and rolled out portfolio-wide. Three lines to wire up, alarms in your inbox the day a source site decides to change their schema.

Which of the three signals is missing from your scraper right now? Drop it in the comments — I'll show you the smallest version that works.

Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

One Playwright Selector Trick Nobody Talks About: getByRole

SIÁN Agency — Sun, 31 May 2026 08:30:00 +0000

Everyone reaches for page.locator(".some-class") first. They shouldn't.

getByRole is the most stable selector in Playwright and almost nobody uses it for scraping. They think it's a testing-library thing. It's not. It's a way of asking the page "what is this element semantically" instead of "what classname does the design system happen to use this week."

That distinction is what kept our Facebook video transcript actor running through three Facebook redesigns this past year.

The 3-item checklist

When does getByRole work? When the site is built by people who care about accessibility. Which is: more sites than you think, especially big ones with legal requirements (US government, EU compliance, large e-commerce).

Check before you skip it:

Open the accessibility tree in Chrome DevTools (Elements → Accessibility tab). If your target element shows a role and an accessible name, getByRole will find it.
Buttons and headings are nearly always tagged correctly. Even sloppy sites give you role="button" and proper heading levels because the design system enforced it.
Forms expose label even when the visual design hides it. getByLabel("Email") works on inputs that don't visibly show "Email" anywhere.

The trick

Compare:

// Class-name brittle
const followBtn = page.locator('._a9-_._a9-_2._a9-_8._a9-_z');

// getByRole — survives layout changes
const followBtn = page.getByRole('button', { name: /follow/i });

The first one breaks the day Facebook tweaks their CSS-in-JS hash. The second one keeps working until they remove the button entirely.

Same for headings:

// "Get the post title"
const title = page.getByRole('heading', { level: 1 });

That works on every site that uses <h1> correctly. Which is most of them, because Google penalises sites that don't.

Quick case

The Facebook transcript actor extracts video metadata from public posts. Facebook ships A/B tests constantly — class names change every couple of weeks. Selectors built on _a9-_8 chains broke regularly.

I rewrote the extractor to use getByRole for everything that had a meaningful role:

Author name → getByRole('link', { name: /^[\w. ]+$/ }) near the post header.
Post text → no role, but [data-ad-comet-preview="message"] (a data- attribute, also stable).
Video player → getByRole('article') containing a <video> element.

Before: ~8 selector breakages per quarter. After: 1 in the last 6 months, and that one was a real structural change (Facebook moved to a new post type), not a class rename.

The CTA you didn't ask for

getByRole is now the first thing every new actor we write tries — including the rebuild of the Facebook AI Transcript Extractor. CSS-class selectors are reserved for the cases where the site's accessibility story is genuinely broken (rare in 2026 — most sites have been audited at least once).

So:

Open your scraper. Run a search for page.locator( with a CSS class chain. How many can you replace with getByRole? Drop the count in the comments — I'll bet it's more than half.

Agree, disagree, or have a site where getByRole falls apart? Reply.

Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Scraping Without Tests Is Gambling. And the House Always Wins.

SIÁN Agency — Fri, 29 May 2026 06:30:00 +0000

Nobody writes tests for scrapers. I get it. The site changes, your tests break, you feel like you spent Tuesday writing tests for the site you don't control. So you skip them.

Then the site changes again. Your scraper silently returns empty rows. The dashboard goes blank. Your client texts at 11pm. You discover, in the cold light of debug, that this exact failure was deterministic and could have been caught in 30 seconds by a single fixture-based test.

The house always wins.

The 3-item checklist

What scrapers actually need to test:

Extraction against a frozen HTML fixture. Save a copy of the page once. Run the parser against it. Assert the fields. This catches your bugs.
Schema validation against a live response. Periodically (daily, weekly), hit one real URL and validate the output shape. This catches their changes.
Smoke test the full pipeline against a known-good URL. End-to-end. One URL. Asserts that you get one row out, with the expected fields. This catches integration breakage.

You don't need a Jest config or a pytest empire. You need three test files.

The replacement: a fixture-first test in <10 lines

# tests/test_extractor.py
from pathlib import Path
from my_scraper.extract import extract_comment

def test_youtube_comment_extraction():
    html = Path("tests/fixtures/youtube_comment_2026-04-01.html").read_text()
    result = extract_comment(html)
    assert result["author"] == "@somecreator"
    assert result["likes"] == 1247
    assert "great video" in result["text"].lower()

Then your extract_comment(html) is a pure function — give it HTML, get a dict back. No browser, no network. Runs in milliseconds. Survives a CI minute budget. Catches every regression in your parsing code instantly.

Save the fixture by literally hitting the URL once and writing the response to disk:

# scripts/refresh_fixture.py
async with async_playwright() as p:
    browser = await p.chromium.launch()
    page = await browser.new_page()
    await page.goto("https://www.youtube.com/watch?v=...")
    Path("tests/fixtures/youtube_comment_2026-04-01.html").write_text(
        await page.content()
    )

Run it once a quarter. When the test starts failing, refresh the fixture, fix the extractor, commit both. That's the loop.

Quick case

On our YouTube comments scraper, fixture-based tests caught three parsing regressions before they ever reached production:

A field rename (likeCount → likeCount plus a thousand-separator format change).
A new "pinned" badge that broke our author-name selector.
A timestamp format change from "2 days ago" to "2d".

All three would have shipped silently. The cron would still run. The CSV would still write. The fields would just be wrong or empty. Instead, the test failed in CI on the PR that introduced the change, fifteen minutes after the fixture was last refreshed.

The cost of writing the test the first time: 20 minutes. The cost of the bugs it caught, if shipped: at minimum a refund and an apology each.

The CTA you didn't ask for

Every actor we ship now starts with three test files:

tests/test_extract.py — fixture-based unit tests for parsing.
tests/test_schema.py — Pydantic / Zod schema check on a live URL, run on a schedule.
tests/test_smoke.py — single-URL end-to-end check on every deploy.

It's the most boring testing pyramid you've ever seen and it has paid for itself an embarrassing number of times — the YouTube comments scraper is where it caught the most regressions in 2026.

So:

Open your scraper. Do you have a tests/ folder? Drop "yes" or "no" in the comments. If "no" — what's stopping you?

Agree, disagree, or have a fixture strategy that actually works? Reply.

Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Why Your Requests + BeautifulSoup Stack Will Fail in Production

SIÁN Agency — Tue, 26 May 2026 08:30:00 +0000

TL;DR — requests plus BeautifulSoup is the right tool for tutorials, side projects, and one-off audits. It is the wrong tool for any scraper that has to run unsupervised, longer than a quarter, against a site that has even basic bot defenses. I've watched a dozen teams discover this the expensive way. Here's the diagnosis and the replacement.

I'm not anti-requests. The library is fast, predictable, and elegant. For 30% of scraping tasks it's still what I reach for first. The problem is that the rest of the scraping pipeline — JavaScript-rendered content, fingerprinting checks, modern auth flows, lazy loading — silently breaks the assumptions requests is built on.

Most teams discover this in stages. Here's the timeline.

Month 1 — "It works"

You write the first version. requests.get(url) returns 200, BeautifulSoup parses the response, you find your selectors, you ship. Tests pass against the small URL set you tested with. Lunch.

Month 2 — "Some pages return empty"

You notice maybe 5% of pages return rows where half the fields are None. You add a check, log the URL, retry. The retry sometimes works.

What's actually happening: those pages render their data in JavaScript after the initial response. requests got the HTML skeleton. The data was never in it. The retries that "work" are coincidence — sometimes the cached page has stale rendered data; sometimes a CDN ships a different variant.

Month 3 — "We're getting 403s"

The target site rolled out a fingerprinting check. requests sends a default User-Agent that screams python-requests/2.31.0. You add headers. It works for two days. They tightened the check — now they look at TLS fingerprint, not just User-Agent. requests uses the system OpenSSL TLS stack, which is different from any real browser's. The block returns.

Month 4 — "We need a session, but it's stateful"

Login flow now requires a CSRF token, which is rendered in JavaScript, which requests can't run. You spend two days reverse-engineering the login flow, find the API endpoint behind it, hit that directly. Works for six weeks. They rotate the auth scheme.

Month 5 — "Let's just use Playwright"

You finally migrate. Most of the team is annoyed because the rewrite took longer than they wanted. The team that does it later is annoyed for the same reason.

The teardown

The fundamental issue: requests is an HTTP client. Modern websites are browser applications. The thing you're scraping is the output of running JavaScript, not a static document. You can fight that for a while — by reverse-engineering APIs, faking TLS fingerprints, hand-rolling JS interpreters — but you're paying interest on a debt you took on the day you reached for requests instead of a real browser.

Specific failure modes you're going to hit:

JavaScript-rendered content. The HTML you fetch contains <div id="root"></div> and not much else.
TLS fingerprinting. requests looks like Python; real browsers look like Chrome/Firefox. Block lists distinguish them easily.
Lazy-loading. Data appears in the DOM only after scroll, click, or visibility events. Static fetch never triggers them.
Modern auth. OAuth, CSRF tokens injected via JS, cookie-based session validation that requires running scripts.
Anti-automation challenges. Cloudflare, PerimeterX, DataDome — all rely on running JavaScript to validate the client.

requests answers none of these. Playwright (or Puppeteer) answers all of them, because Playwright is a browser.

The replacement pattern

Skip the year of pain. Start with Playwright. Use requests only when you've measured that the data is in the static HTML and the site has no fingerprinting:

from playwright.async_api import async_playwright

async def scrape(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        ctx = await browser.new_context(
            user_agent="Mozilla/5.0 (...)",
            viewport={"width": 1920, "height": 1080},
        )
        page = await ctx.new_page()

        # Block heavy resources for speed.
        await page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}",
                         lambda r: r.abort())

        await page.goto(url, wait_until="domcontentloaded")
        # Wait for the *data* to appear, not just the document.
        await page.wait_for_selector('[data-product-id]', timeout=15_000)

        return await extract_fields(page)

Five things requests can't give you that Playwright does for free:

JavaScript execution — your selectors target rendered DOM, not the source.
Realistic TLS fingerprint — Chromium does this for you.
Cookie/session handling that matches a real browser.
wait_for_selector — semantic waits instead of time.sleep.
Routing controls — block what you don't need, accelerate what you do.

When `requests` is still right

Static documentation sites. Open RSS/Atom feeds. JSON APIs that don't require login. PDFs and CSVs hosted on S3. Anything where you've actually fetched the URL, looked at the response body, and confirmed your data is in it.

That's a real category. Just don't assume the next site you scrape will fall into it.

Result

Across our actor portfolio, the migration ratio settled around 80/20 — Playwright for 80% of jobs, requests for the 20% where the data is genuinely static. The 80% includes our entire Sephora catalog pipeline, which spent its first version as a requests + BeautifulSoup script and never made it past month 2. The Playwright rewrite has been running unsupervised for 14 months.

If your scraper is currently 100% requests, your sample size isn't "this works fine." Your sample size is "the sites I've scraped so far happen to have static HTML."

Which of the five failure modes have you shipped to production? Drop the symptom in the comments — I'll point at the fix.

Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Stop Fighting the DOM. Selector-First Thinking Will Save Your Scraper.

SIÁN Agency — Sun, 24 May 2026 09:00:00 +0000

Most broken scrapers I see have the same shape: someone wrote the extraction logic first and the selectors second. The selectors were an afterthought — whatever worked in DevTools at 2am.

That's backwards. Selectors are the contract between your code and the page. Get them wrong and the rest of your scraper is irrelevant.

The mindset shift

Selector-first thinking means: before you write a single line of extraction code, you decide how the data is identified. Not "how do I get the price?" but "what does the page tell me, programmatically, that this thing is a price?"

Three answers, in order of preference:

Semantics — getByRole, getByLabel, getByText. These mirror what an accessibility tree exposes. They survive design changes.
Data attributes — data-testid, data-product-id, itemprop. Devs often add these for their own tests; you get to free-ride.
Structured data — JSON-LD, microdata, OpenGraph. The page is already telling Google what's a price; let it tell you too.

CSS classes are last resort. Class names are styling, not identity. They change when the design changes. They're the equivalent of asking for "the third button from the top" — works until someone rearranges the menu.

The 3-item checklist

Before you write a selector:

Open the accessibility tree in DevTools (Chrome: Elements → Accessibility tab). If the data has a role and an accessible name, use getByRole.
Search the page source for application/ld+json. If it's there and contains your fields, parse it directly. No DOM walking needed.
Look for data-* attributes near the data. Devs leave testing hooks everywhere. Use theirs.

If none of those work, then fall back to CSS or XPath. And when you do, anchor to something stable — a parent landmark, an aria-label, a data- attribute — not just a class chain.

The 10-line replacement

Here's the priority I use in every new actor:

async function extractPrice(page) {
  // 1. Structured data first.
  const ld = await page.locator('script[type="application/ld+json"]')
                       .first().textContent();
  const data = JSON.parse(ld ?? '{}');
  if (data?.offers?.price) return data.offers.price;

  // 2. Semantic selectors.
  const priceByLabel = page.getByLabel(/^price$/i);
  if (await priceByLabel.count()) return priceByLabel.textContent();

  // 3. Data attributes.
  const priceByData = page.locator('[data-testid="price"]');
  if (await priceByData.count()) return priceByData.textContent();

  // 4. Last resort: CSS class. Logged loudly so we know we're in fallback.
  console.warn('Falling back to CSS selector — selector audit needed.');
  return page.locator('.price-tag').textContent();
}

Notice the warn() in the fallback path. When that warning starts appearing in your logs, it means the site changed its higher-priority signals and you're one design refresh away from breakage. Fix it before it breaks, not after.

Quick case

On our Idealista actor, the priority order above turned a "fix the selector every 6 weeks" routine into a "fix the selector twice a year" routine. The JSON-LD path catches 95% of listings without ever touching the DOM. The accessibility-role fallback catches another 4%. The CSS fallback fires on edge-case property types and tells us when a new layout has shipped — usually a week before any of our other monitoring would have noticed.

The CTA you didn't ask for

This selector ladder is the second thing every actor we ship gets, right after the request blocking from last week's post — see it in action in the Idealista actor. It's so consistent we made it a util.

So:

Open your scraper's selector code right now. Count how many class-name chains you have versus semantic / structured-data lookups. Drop the ratio in the comments. Bonus points for the longest CSS chain — I bet someone has .product-grid > .item:nth-child(3) > .price > span > strong.

Agree, disagree, or have a site that genuinely needs CSS chains? Reply.

Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run

SIÁN Agency — Fri, 22 May 2026 12:30:00 +0000

Most Playwright tutorials teach you to scrape a single page. Real scrapers need to scrape thousands. The thing that kills you isn't the selector — it's everything Playwright does before it touches the selector.

By default, Playwright loads a page like a human visiting a website. It downloads CSS, fonts, analytics scripts, A/B testing pixels, hero images, lazy-loaded carousels, and three different chat widgets. On a product catalog page, that's 4–6 MB of stuff you don't need. Times 10,000 pages, that's the difference between a 20-minute run and a 3-hour run.

Here's the 10-line route handler I drop into every actor:

const BLOCKED = ['image', 'media', 'font', 'stylesheet'];

await context.route('**/*', (route) => {
  const type = route.request().resourceType();
  const url  = route.request().url();
  if (BLOCKED.includes(type)) return route.abort();
  if (/google-analytics|doubleclick|hotjar|segment|gtm/.test(url)) {
    return route.abort();
  }
  route.continue();
});

That's it. Two lists: resource types you don't need, and tracking domains you definitely don't need.

The 3-item checklist before you ship this

Test that your data is still there. Some sites lazy-load product info into image data- attributes. Aborting images can sometimes break extraction. Run with and without the route handler and diff the output.
Don't block scripts. Modern sites build the DOM with JS. Aborting scripts will give you an empty page. (CSS and fonts are safe — Playwright doesn't need them to find selectors.)
Watch for sites that detect this. Some bot-detection scripts check whether you fetched the analytics pixel. If your success rate drops after enabling this, allow the analytics domains back through.

Quick case

On our Sephora product info actor, this single change cut average page load from 4.8s to 1.3s. Across a 5000-product catalog scrape, that's the difference between 6.5 hours and 1.8 hours. Same selectors, same data, same success rate. We just stopped downloading hero images of moisturizers we never look at.

It also dropped our Apify compute units per run by ~60%, which directly affects what we charge customers. Faster scraper, lower cost, same output. The route handler now ships with the Sephora product info actor and every new scraper after it.

The CTA you didn't ask for

This route handler ships with our starter actor template. New scrapers get it on day one. Old scrapers got it bolted on the first time we noticed runtime > 1 hour.

The pattern works on any browser-based scraper — Playwright, Puppeteer, Selenium with CDP. The shape is always: tell the browser what not to load, before you tell it what to find.

One quick note for the JS-heavy among you: the same pattern applies to Puppeteer's page.setRequestInterception(true) — same idea, slightly different API. Same wins.

Drop your slowest scraper's runtime in the comments. I'll guess what's eating your minutes. (Hint: it's probably hero images.)

Agree, disagree, or have a site where blocking images breaks something subtle? Reply.

Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Stop Building Fragile Scrapers — Build Actors Instead

SIÁN Agency — Mon, 18 May 2026 13:30:00 +0000

TL;DR — A "scraper" is a script that ran once. An "actor" is a unit of work with an input contract, an output schema, observability, and a billing model. Same code, completely different operational surface. We migrated our Bayut property pipeline from the first to the second this quarter and the support load dropped 70%.

I get sent a lot of scraper repos to "review" — usually after they've broken in production. They look surprisingly similar:

One Python file, 300–600 lines.
A main() that loops over URLs.
requests.get() plus BeautifulSoup plus a try/except: pass that swallows everything.
Output written to a CSV called output.csv in the working directory.
A cron job that triggers it nightly. Sometimes a Slack webhook on failure that stopped working six months ago.

This is what I call a script that ran once. The fact that it ran in production doesn't make it production code.

The teardown is always the same.

The five failure modes you inherit when you ship a script

No input contract. The script reads URLs from a hardcoded list or a file path that only exists on your laptop. New requirement → edit the file → redeploy → hope.
No output schema. Whatever fields happened to be present this run get written. When the source site adds a column, the CSV silently widens. When the source site removes a column, downstream breaks at parse time, three hops away from the cause.
No observability. "Did it run last night?" is answered by SSH-ing to the box and ls -la output.csv. Run history is the file's mtime. Failure mode is "the file is older than expected."
No retries with backoff. A 503 from the target site at 02:14 kills the run. There is no second attempt. The next run is in 24 hours.
No billing surface. The cost of running it is your time and your server. There is no per-unit price, so there is no signal that the unit economics are bad until you check the AWS bill.

A script is fine for "I need this data once." It is not fine for "we need this data nightly for the next two years." But teams keep shipping #1 to fulfill #2.

What an actor is

Strip the marketing word and an actor is just: a containerised job with a declared input schema, a declared output schema, and a runtime that handles scheduling, retries, logs, persistent storage, and billing. Apify is one implementation — there are others. The shape matters more than the vendor.

When we rebuilt our Bayut property scraper as an actor, four things changed at the level of code:

// 1. Input is validated against a schema before main() runs.
//    Bad input fails fast with a useful error, not silent miss.
const input = await Actor.getInput(); // INPUT_SCHEMA.json enforces shape

// 2. Output goes to a typed dataset. New fields require a schema
//    change — not a silent CSV widening.
await Dataset.pushData({
  listingId, price, currency, address, lat, lng, scrapedAt
});

// 3. Failures retry with backoff at the platform level.
//    Our code throws; the runtime decides what to do.
throw new ScrapeFailure('listing-blocked', { url, status: 429 });

// 4. Logs are structured, queryable, and indexed by run.
log.warning('rate-limit', { url, retryAfter: 60 });

That's it. Same Playwright, same selectors, same scraping logic. The difference is that all the boring infrastructure — input validation, output typing, retries, logs, scheduling, billing — is no longer your problem.

Result

For Bayut specifically, three months after the migration:

Mean time to detect a breakage went from ~36 hours (next-day stakeholder complaint) to under 15 minutes (failed runs alert with the offending URL and HTTP status).
Support tickets dropped 70%. Most of the volume was "the data is missing" — invisible failures from the cron-script era. With per-run datasets, failed runs surface themselves.
Cost per 1000 listings went down, not up. Concurrency at the runtime level is cheaper than spinning up your own queue.

The migration itself took about a week. Most of the time was not the scraping logic — that was already there. It was deciding what the input schema should be, what the output schema should be, and which fields were "nice to have" vs "the dataset is broken without this."

The replacement pattern

If you're sitting on a script-shaped scraper right now, the migration order is:

Write the input schema. Force every run to declare what it's scraping.
Write the output schema. Force every row to validate before it gets persisted.
Move retries from try/except: pass to the runtime.
Replace print() with structured logs.
Containerise. Whatever runs in python main.py should run in docker run.
Pick a runtime — Apify, your own k8s cron, whatever. The schema work is portable.

You do steps 1–5 inside your existing repo. You haven't committed to a vendor yet. By the time you reach step 6, the actor exists — the runtime is just a deployment target.

We packaged this migration shape into a starter we use for every new client engagement — same six steps that produced the Bayut property scraper above. Same six steps, every time.

Which of the five failure modes is currently shipping in your stack? Drop it in the comments — I'll point at the smallest change that fixes it.

Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

DEV Community: SIÁN Agency

Someone Ran My Scraper 1,251 Times and Paid Me Nothing

The spike you'll mistake for success

Free traffic is not free to you

The two-fingerprint tell

Why your billing looked fine the whole time

The brake I should have shipped first

The one gotcha before you copy this

Do this today

Replayable Runs > Faster Runs. Stop Optimising for the Wrong Number.

The 3-item checklist

The trick — input/output as separate datasets

Quick case

The CTA you didn't ask for

I Rewrote Our Instagram Transcript Actor for Pay-Per-Event Pricing. The Economics Flipped.

The teardown

The replacement pattern

Result

Migration Playbook: Cron Script Actor. Six Steps, No Rewrites.

The six-step migration order

Step 1 — Extract input from the script body

Step 2 — Replace ad-hoc output with a structured dataset

Step 3 — Replace bare try/except with structured failures

Step 4 — Replace print() with structured logging

Step 5 — Containerise

Step 6 — Swap the runtime

Why this order

Result

When this is wrong

I Broke This Scraper on Purpose. Here's What Shipped to Production Unprotected.

The 3-item attack list

The trick — Playwright route handlers as fault injectors

What broke when I did this last week

The CTA you didn't ask for

Schema Drift Is the Silent Killer. Here's What to Log So You Actually Catch It.

Why this matters

The teardown of why this gets missed

The replacement pattern

What to monitor specifically

Result

When this is wrong

One Playwright Selector Trick Nobody Talks About: getByRole

The 3-item checklist

The trick

Quick case

The CTA you didn't ask for

Scraping Without Tests Is Gambling. And the House Always Wins.

The 3-item checklist

The replacement: a fixture-first test in <10 lines

Quick case

The CTA you didn't ask for

Why Your Requests + BeautifulSoup Stack Will Fail in Production

Month 1 — "It works"

Month 2 — "Some pages return empty"

Month 3 — "We're getting 403s"

Month 4 — "We need a session, but it's stateful"

Month 5 — "Let's just use Playwright"

The teardown

The replacement pattern

When requests is still right

Result

Stop Fighting the DOM. Selector-First Thinking Will Save Your Scraper.

The mindset shift

The 3-item checklist

The 10-line replacement

Quick case

The CTA you didn't ask for

A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run

The 3-item checklist before you ship this

Quick case

The CTA you didn't ask for

Stop Building Fragile Scrapers — Build Actors Instead

The five failure modes you inherit when you ship a script

What an actor is

Result

The replacement pattern

Step 3 — Replace bare `try/except` with structured failures

Step 4 — Replace `print()` with structured logging

When `requests` is still right