<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Souymodeep Banerjee</title>
    <description>The latest articles on DEV Community by Souymodeep Banerjee (@soumo_banerjee).</description>
    <link>https://dev.to/soumo_banerjee</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3812034%2F1a14a979-55f1-4a9e-b43d-d4d34a04272b.png</url>
      <title>DEV Community: Souymodeep Banerjee</title>
      <link>https://dev.to/soumo_banerjee</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/soumo_banerjee"/>
    <language>en</language>
    <item>
      <title>How I scrape and de-dupe Meta ads for 1000 brands</title>
      <dc:creator>Souymodeep Banerjee</dc:creator>
      <pubDate>Tue, 21 Apr 2026 17:29:20 +0000</pubDate>
      <link>https://dev.to/soumo_banerjee/how-i-scrape-and-de-dupe-meta-ads-for-1000-brands-4kca</link>
      <guid>https://dev.to/soumo_banerjee/how-i-scrape-and-de-dupe-meta-ads-for-1000-brands-4kca</guid>
      <description>&lt;p&gt;I run &lt;a href="https://brandmov.com" rel="noopener noreferrer"&gt;Brandmov&lt;/a&gt;, a tracker for what DTC brands are running as Meta ads. Behind it: a weekly pipeline, a lot of brand seeds, a table that grows over time.&lt;/p&gt;

&lt;p&gt;This isn't a recipe. No selectors, no endpoint names  those posts break in six weeks. These are the six problems you can't skip once your scraper becomes a scheduled job that has to survive for a year.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Want to see the output? &lt;a href="https://brandmov.com/tools/competitor-ads-lookup" rel="noopener noreferrer"&gt;Competitor Ads Lookup&lt;/a&gt;  free, no signup, paste a brand, get deduped live ads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  1. Bot detection is a decision, not a bug
&lt;/h2&gt;

&lt;p&gt;Stances, cheapest → most expensive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser mimicry.&lt;/strong&gt; Real viewport, locale, timezone, pointer cadence, no automation flags leaking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IP rotation.&lt;/strong&gt; Residential or mobile proxies, per-session or per-request, geo-matched to content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gate solving.&lt;/strong&gt; Captcha APIs, token relays, third-party solver SLAs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-loop.&lt;/strong&gt; Operator clears challenge, session stays warm for N minutes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fingerprint surface to audit on your own client:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TLS / JA3 / JA4  headless browsers ship with distinguishable TLS stacks.&lt;/li&gt;
&lt;li&gt;HTTP/2 frame order and SETTINGS values.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;navigator.webdriver&lt;/code&gt;, Chrome DevTools Protocol flags, missing &lt;code&gt;chrome&lt;/code&gt; global.&lt;/li&gt;
&lt;li&gt;Canvas, WebGL, AudioContext fingerprints.&lt;/li&gt;
&lt;li&gt;Font list, plugin list, &lt;code&gt;Accept-Language&lt;/code&gt; vs timezone mismatch.&lt;/li&gt;
&lt;li&gt;Mouse jitter, scroll velocity, keypress dwell time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Per-session telemetry to emit (before you need it):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Challenge shown? (yes/no/type)&lt;/li&gt;
&lt;li&gt;HTTP error rate in window.&lt;/li&gt;
&lt;li&gt;Response latency distribution vs baseline.&lt;/li&gt;
&lt;li&gt;Empty / truncated payload rate.&lt;/li&gt;
&lt;li&gt;4xx by code, 5xx by code, explicit rate-limit codes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Escalation policy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start at the cheapest stance that has ever worked for this target.&lt;/li&gt;
&lt;li&gt;Promote on two consecutive sessions failing the health check.&lt;/li&gt;
&lt;li&gt;Demote after K clean sessions.&lt;/li&gt;
&lt;li&gt;Promote is config, not code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common traps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting a custom &lt;code&gt;User-Agent&lt;/code&gt; but forgetting to match &lt;code&gt;sec-ch-ua&lt;/code&gt; hints.&lt;/li&gt;
&lt;li&gt;Spoofing timezone but not &lt;code&gt;Intl.DateTimeFormat().resolvedOptions().timeZone&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Proxy pools that share IPs across tenants  you inherit someone else's ban.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Your queue is the system. The scraper is a worker.
&lt;/h2&gt;

&lt;p&gt;What the for-loop can't do, the queue can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At-least-once delivery with idempotency keys.&lt;/li&gt;
&lt;li&gt;Visibility timeouts sized for worst-case task duration × 1.5.&lt;/li&gt;
&lt;li&gt;Explicit retry counter with max-attempts.&lt;/li&gt;
&lt;li&gt;Dead-letter lane on attempt N.&lt;/li&gt;
&lt;li&gt;Resume after host restart without replaying successful work.&lt;/li&gt;
&lt;li&gt;Per-target concurrency caps, independent of worker count.&lt;/li&gt;
&lt;li&gt;Backlog metrics per priority tier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Queue options, rough comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Durability&lt;/th&gt;
&lt;th&gt;Ordering&lt;/th&gt;
&lt;th&gt;Fairness primitives&lt;/th&gt;
&lt;th&gt;Ops cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SQS&lt;/td&gt;
&lt;td&gt;high&lt;/td&gt;
&lt;td&gt;FIFO optional&lt;/td&gt;
&lt;td&gt;none native&lt;/td&gt;
&lt;td&gt;low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis Streams&lt;/td&gt;
&lt;td&gt;medium&lt;/td&gt;
&lt;td&gt;per-stream&lt;/td&gt;
&lt;td&gt;consumer groups&lt;/td&gt;
&lt;td&gt;low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NATS JetStream&lt;/td&gt;
&lt;td&gt;high&lt;/td&gt;
&lt;td&gt;per-subject&lt;/td&gt;
&lt;td&gt;subject hierarchy&lt;/td&gt;
&lt;td&gt;medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Postgres-backed&lt;/td&gt;
&lt;td&gt;high&lt;/td&gt;
&lt;td&gt;SQL-controlled&lt;/td&gt;
&lt;td&gt;full SQL&lt;/td&gt;
&lt;td&gt;free if you already run PG&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Fairness patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Oldest-first.&lt;/strong&gt; Priority = &lt;code&gt;now - last_success&lt;/code&gt;. Starves nothing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weighted round-robin.&lt;/strong&gt; Each category gets a fixed slice of each tick.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stratified sampling.&lt;/strong&gt; Sample N from each bucket per run, not N from the whole set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SLA tiers.&lt;/strong&gt; Hot tier runs daily, warm tier weekly, cold tier monthly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poison detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same error class on attempt 3 → dead-letter.&lt;/li&gt;
&lt;li&gt;Rising cost with no data returned → dead-letter.&lt;/li&gt;
&lt;li&gt;Seed that has never succeeded after 30 days → quarantine.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Visibility timeout sizing rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start at &lt;code&gt;p95(task_duration) × 1.5&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If you see duplicate-execution incidents, raise it.&lt;/li&gt;
&lt;li&gt;If you see stuck-seed incidents, lower it and diagnose the slow task.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Dedupe on the platform's ID, or pay for it
&lt;/h2&gt;

&lt;p&gt;Options, with failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Platform ID.&lt;/strong&gt; Cheapest, correct when stable. Trap: IDs rotate across schema migrations, or are session-scoped and look stable for a single run. Probe longitudinally before trusting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content hash.&lt;/strong&gt; Cheap, almost never correct. Trap: whitespace diffs, CDN URL rotation, A/B copy tests, localized variants, truncation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probabilistic match.&lt;/strong&gt; Necessary cross-source. Wrong-merges are harder to audit than misses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ID stability probe (run once before trusting an ID):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Capture the same entity weekly for ≥ 4 weeks.&lt;/li&gt;
&lt;li&gt;Confirm ID is stable across: different geo, different sort, different pagination offset.&lt;/li&gt;
&lt;li&gt;Confirm ID survives a platform-side field rename.&lt;/li&gt;
&lt;li&gt;If any fail, the ID is session-scoped; fall back to content hash + field triangulation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What "invariant content" actually means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not the title (A/B tests it).&lt;/li&gt;
&lt;li&gt;Not the body (localized).&lt;/li&gt;
&lt;li&gt;Not the media URL (CDN-rotated).&lt;/li&gt;
&lt;li&gt;Not the status (time-varying).&lt;/li&gt;
&lt;li&gt;Maybe the creation date + brand + structural shape (card count, format).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Counters every dedupe path must emit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dedupe.merged&lt;/code&gt;  two payloads collapsed into one record.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dedupe.distinct_same_shape&lt;/code&gt;  two records with identical non-ID fields, different IDs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dedupe.conflict&lt;/code&gt;  same ID, contradictory immutable fields.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dedupe.first_seen&lt;/code&gt;  new record.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The counter that tells you the platform changed: &lt;code&gt;distinct_same_shape&lt;/code&gt; spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Store every sighting. Never &lt;code&gt;UPDATE&lt;/code&gt;.
&lt;/h2&gt;

&lt;p&gt;Schema shape (conceptual):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Entity&lt;/code&gt;  identity only. &lt;code&gt;entity_id&lt;/code&gt;, created-at, brand FK.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Observation&lt;/code&gt;  append-only. &lt;code&gt;observation_id&lt;/code&gt;, &lt;code&gt;entity_id&lt;/code&gt; FK, &lt;code&gt;observed_at&lt;/code&gt;, payload snapshot, run FK.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Run&lt;/code&gt;  metadata about each pipeline execution.&lt;/li&gt;
&lt;li&gt;Views: &lt;code&gt;latest_observation_per_entity&lt;/code&gt;, &lt;code&gt;first_seen&lt;/code&gt;, &lt;code&gt;last_seen&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Indexes you'll need within three months:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;(entity_id, observed_at DESC)&lt;/code&gt;  latest observation.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(run_id, entity_id)&lt;/code&gt;  reconstruct a run.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(brand_id, first_seen)&lt;/code&gt;  launch timelines.&lt;/li&gt;
&lt;li&gt;Partial index on &lt;code&gt;last_seen_before_now - 7d&lt;/code&gt;  "stopped in the last week."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Late-arriving observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Allow &lt;code&gt;observed_at&lt;/code&gt; &amp;lt; &lt;code&gt;max(observed_at)&lt;/code&gt; for the entity  backfills happen.&lt;/li&gt;
&lt;li&gt;Never infer &lt;code&gt;last_seen&lt;/code&gt; as &lt;code&gt;max(observed_at)&lt;/code&gt; blindly; store it as a derived column, rebuild on backfill.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Retention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hot (queryable): last 90 days, in primary DB.&lt;/li&gt;
&lt;li&gt;Warm (occasional): 90d–2y, in columnar storage (Parquet on object store).&lt;/li&gt;
&lt;li&gt;Cold (audit): &amp;gt; 2y, archived.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Snapshot tables for hot queries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Materialize "currently-live entities per brand" nightly.&lt;/li&gt;
&lt;li&gt;Don't rebuild it per request.&lt;/li&gt;
&lt;li&gt;Invalidate on run completion, not on observation write.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Slow down on purpose
&lt;/h2&gt;

&lt;p&gt;Rate-limit layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Token bucket, per target, local.&lt;/strong&gt; N per minute. Your budget ≤ 50% of observed tolerance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token bucket, per session.&lt;/strong&gt; Avoids burst even when global budget allows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent-session cap, per target.&lt;/strong&gt; Usually 1. Parallelism here buys bans.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Circuit breaker states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;closed&lt;/code&gt;  normal.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;open&lt;/code&gt;  skip this target entirely until cooldown.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;half-open&lt;/code&gt;  next run sends a single canary; success returns to &lt;code&gt;closed&lt;/code&gt;, failure re-opens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Backoff rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Between runs only. Never inside a burned session.&lt;/li&gt;
&lt;li&gt;Exponential with full jitter: &lt;code&gt;sleep = rand(0, base × 2^attempt)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Cap at &lt;code&gt;max_backoff&lt;/code&gt; (e.g. 24h) and alert.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Structured signals to respect unconditionally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP 429.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Retry-After&lt;/code&gt; header  honor exact value, don't halve it.&lt;/li&gt;
&lt;li&gt;Platform-specific rate-limit error codes in payload body.&lt;/li&gt;
&lt;li&gt;Explicit CAPTCHA interstitials.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Signals to ignore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"200 OK with empty body"  treat as soft failure, not success.&lt;/li&gt;
&lt;li&gt;"Slightly slower response times"  in range of noise.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Session hygiene:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rotate session on circuit-breaker trip, not per request.&lt;/li&gt;
&lt;li&gt;Cache warm sessions with TTL; a warm session that just completed a clean run is gold.&lt;/li&gt;
&lt;li&gt;Log session lineage: how many entities has this session touched, how old is it, what's its error rate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  6. Keep the raw. It's the cheapest insurance.
&lt;/h2&gt;

&lt;p&gt;Storage layout:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;raw/{target}/{run_id}/{seed_id}/{sequence}.json.zst&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Manifest per run: list of all raw artifacts, hashes, byte counts.&lt;/li&gt;
&lt;li&gt;Content-addressed storage optional; by-run layout is usually enough.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compression:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zstd at level 3–6 for JSON payloads. 5–10× smaller than raw.&lt;/li&gt;
&lt;li&gt;Don't compress per-request; batch at the run boundary.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Replayability requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raw + run metadata + parser version = deterministic normalized output.&lt;/li&gt;
&lt;li&gt;Parser takes raw input only, no network calls during re-parse.&lt;/li&gt;
&lt;li&gt;Parser version is tagged per output row.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cold tier policy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raw &amp;gt; 180 days → move to archive tier (cheaper, slower retrieval).&lt;/li&gt;
&lt;li&gt;Keep indexes on what's in archive, not the archive itself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What raw re-parsing has rescued (one year, three incidents):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Silently dropped nested field  re-parsed, backfilled.&lt;/li&gt;
&lt;li&gt;New attribute added  recovered six months of history with no rescrape.&lt;/li&gt;
&lt;li&gt;Platform shipped a new record variant  re-parsed old runs, reclassified.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rule: fetcher and parser must be separate processes with a durable artifact between them. If they're one process, you can't evolve.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Bot detection is a stance.&lt;/strong&gt; Escalate on signal, don't hard-code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use a real queue.&lt;/strong&gt; Fairness is the product's freshness SLA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dedupe on the platform's ID.&lt;/strong&gt; Probe stability first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store every sighting.&lt;/strong&gt; History is the product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Be a polite guest.&lt;/strong&gt; Token buckets, circuit breakers, backoff between runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep the raw.&lt;/strong&gt; Parser and fetcher must be separable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else is tuning. Get the shape right, numbers come later.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://brandmov.com/tools/competitor-ads-lookup" rel="noopener noreferrer"&gt;Competitor Ads Lookup&lt;/a&gt;  free, no signup, deduped live ads for any brand.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://brandmov.com" rel="noopener noreferrer"&gt;Brandmov&lt;/a&gt;  the full tracker, with weekly observations and launch/stop dates.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>playwright</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
