DEV Community

Souymodeep Banerjee
Souymodeep Banerjee

Posted on

How I scrape and de-dupe Meta ads for 1000 brands

I run Brandmov, a tracker for what DTC brands are running as Meta ads. Behind it: a weekly pipeline, a lot of brand seeds, a table that grows over time.

This isn't a recipe. No selectors, no endpoint names those posts break in six weeks. These are the six problems you can't skip once your scraper becomes a scheduled job that has to survive for a year.

Want to see the output? Competitor Ads Lookup free, no signup, paste a brand, get deduped live ads.

1. Bot detection is a decision, not a bug

Stances, cheapest → most expensive:

  • Browser mimicry. Real viewport, locale, timezone, pointer cadence, no automation flags leaking.
  • IP rotation. Residential or mobile proxies, per-session or per-request, geo-matched to content.
  • Gate solving. Captcha APIs, token relays, third-party solver SLAs.
  • Human-in-loop. Operator clears challenge, session stays warm for N minutes.

Fingerprint surface to audit on your own client:

  • TLS / JA3 / JA4 headless browsers ship with distinguishable TLS stacks.
  • HTTP/2 frame order and SETTINGS values.
  • navigator.webdriver, Chrome DevTools Protocol flags, missing chrome global.
  • Canvas, WebGL, AudioContext fingerprints.
  • Font list, plugin list, Accept-Language vs timezone mismatch.
  • Mouse jitter, scroll velocity, keypress dwell time.

Per-session telemetry to emit (before you need it):

  • Challenge shown? (yes/no/type)
  • HTTP error rate in window.
  • Response latency distribution vs baseline.
  • Empty / truncated payload rate.
  • 4xx by code, 5xx by code, explicit rate-limit codes.

Escalation policy:

  • Start at the cheapest stance that has ever worked for this target.
  • Promote on two consecutive sessions failing the health check.
  • Demote after K clean sessions.
  • Promote is config, not code.

Common traps:

  • Setting a custom User-Agent but forgetting to match sec-ch-ua hints.
  • Spoofing timezone but not Intl.DateTimeFormat().resolvedOptions().timeZone.
  • Proxy pools that share IPs across tenants you inherit someone else's ban.

2. Your queue is the system. The scraper is a worker.

What the for-loop can't do, the queue can:

  • At-least-once delivery with idempotency keys.
  • Visibility timeouts sized for worst-case task duration × 1.5.
  • Explicit retry counter with max-attempts.
  • Dead-letter lane on attempt N.
  • Resume after host restart without replaying successful work.
  • Per-target concurrency caps, independent of worker count.
  • Backlog metrics per priority tier.

Queue options, rough comparison:

Option Durability Ordering Fairness primitives Ops cost
SQS high FIFO optional none native low
Redis Streams medium per-stream consumer groups low
NATS JetStream high per-subject subject hierarchy medium
Postgres-backed high SQL-controlled full SQL free if you already run PG

Fairness patterns:

  • Oldest-first. Priority = now - last_success. Starves nothing.
  • Weighted round-robin. Each category gets a fixed slice of each tick.
  • Stratified sampling. Sample N from each bucket per run, not N from the whole set.
  • SLA tiers. Hot tier runs daily, warm tier weekly, cold tier monthly.

Poison detection:

  • Same error class on attempt 3 → dead-letter.
  • Rising cost with no data returned → dead-letter.
  • Seed that has never succeeded after 30 days → quarantine.

Visibility timeout sizing rule:

  • Start at p95(task_duration) × 1.5.
  • If you see duplicate-execution incidents, raise it.
  • If you see stuck-seed incidents, lower it and diagnose the slow task.

3. Dedupe on the platform's ID, or pay for it

Options, with failure modes:

  • Platform ID. Cheapest, correct when stable. Trap: IDs rotate across schema migrations, or are session-scoped and look stable for a single run. Probe longitudinally before trusting.
  • Content hash. Cheap, almost never correct. Trap: whitespace diffs, CDN URL rotation, A/B copy tests, localized variants, truncation.
  • Probabilistic match. Necessary cross-source. Wrong-merges are harder to audit than misses.

ID stability probe (run once before trusting an ID):

  • Capture the same entity weekly for ≥ 4 weeks.
  • Confirm ID is stable across: different geo, different sort, different pagination offset.
  • Confirm ID survives a platform-side field rename.
  • If any fail, the ID is session-scoped; fall back to content hash + field triangulation.

What "invariant content" actually means:

  • Not the title (A/B tests it).
  • Not the body (localized).
  • Not the media URL (CDN-rotated).
  • Not the status (time-varying).
  • Maybe the creation date + brand + structural shape (card count, format).

Counters every dedupe path must emit:

  • dedupe.merged two payloads collapsed into one record.
  • dedupe.distinct_same_shape two records with identical non-ID fields, different IDs.
  • dedupe.conflict same ID, contradictory immutable fields.
  • dedupe.first_seen new record.

The counter that tells you the platform changed: distinct_same_shape spikes.

4. Store every sighting. Never UPDATE.

Schema shape (conceptual):

  • Entity identity only. entity_id, created-at, brand FK.
  • Observation append-only. observation_id, entity_id FK, observed_at, payload snapshot, run FK.
  • Run metadata about each pipeline execution.
  • Views: latest_observation_per_entity, first_seen, last_seen.

Indexes you'll need within three months:

  • (entity_id, observed_at DESC) latest observation.
  • (run_id, entity_id) reconstruct a run.
  • (brand_id, first_seen) launch timelines.
  • Partial index on last_seen_before_now - 7d "stopped in the last week."

Late-arriving observations:

  • Allow observed_at < max(observed_at) for the entity backfills happen.
  • Never infer last_seen as max(observed_at) blindly; store it as a derived column, rebuild on backfill.

Retention:

  • Hot (queryable): last 90 days, in primary DB.
  • Warm (occasional): 90d–2y, in columnar storage (Parquet on object store).
  • Cold (audit): > 2y, archived.

Snapshot tables for hot queries:

  • Materialize "currently-live entities per brand" nightly.
  • Don't rebuild it per request.
  • Invalidate on run completion, not on observation write.

5. Slow down on purpose

Rate-limit layers:

  • Token bucket, per target, local. N per minute. Your budget ≤ 50% of observed tolerance.
  • Token bucket, per session. Avoids burst even when global budget allows.
  • Concurrent-session cap, per target. Usually 1. Parallelism here buys bans.

Circuit breaker states:

  • closed normal.
  • open skip this target entirely until cooldown.
  • half-open next run sends a single canary; success returns to closed, failure re-opens.

Backoff rules:

  • Between runs only. Never inside a burned session.
  • Exponential with full jitter: sleep = rand(0, base × 2^attempt).
  • Cap at max_backoff (e.g. 24h) and alert.

Structured signals to respect unconditionally:

  • HTTP 429.
  • Retry-After header honor exact value, don't halve it.
  • Platform-specific rate-limit error codes in payload body.
  • Explicit CAPTCHA interstitials.

Signals to ignore:

  • "200 OK with empty body" treat as soft failure, not success.
  • "Slightly slower response times" in range of noise.

Session hygiene:

  • Rotate session on circuit-breaker trip, not per request.
  • Cache warm sessions with TTL; a warm session that just completed a clean run is gold.
  • Log session lineage: how many entities has this session touched, how old is it, what's its error rate.

6. Keep the raw. It's the cheapest insurance.

Storage layout:

  • raw/{target}/{run_id}/{seed_id}/{sequence}.json.zst
  • Manifest per run: list of all raw artifacts, hashes, byte counts.
  • Content-addressed storage optional; by-run layout is usually enough.

Compression:

  • Zstd at level 3–6 for JSON payloads. 5–10× smaller than raw.
  • Don't compress per-request; batch at the run boundary.

Replayability requirements:

  • Raw + run metadata + parser version = deterministic normalized output.
  • Parser takes raw input only, no network calls during re-parse.
  • Parser version is tagged per output row.

Cold tier policy:

  • Raw > 180 days → move to archive tier (cheaper, slower retrieval).
  • Keep indexes on what's in archive, not the archive itself.

What raw re-parsing has rescued (one year, three incidents):

  1. Silently dropped nested field re-parsed, backfilled.
  2. New attribute added recovered six months of history with no rescrape.
  3. Platform shipped a new record variant re-parsed old runs, reclassified.

Rule: fetcher and parser must be separate processes with a durable artifact between them. If they're one process, you can't evolve.

The short version

  1. Bot detection is a stance. Escalate on signal, don't hard-code.
  2. Use a real queue. Fairness is the product's freshness SLA.
  3. Dedupe on the platform's ID. Probe stability first.
  4. Store every sighting. History is the product.
  5. Be a polite guest. Token buckets, circuit breakers, backoff between runs.
  6. Keep the raw. Parser and fetcher must be separable.

Everything else is tuning. Get the shape right, numbers come later.

  • Competitor Ads Lookup free, no signup, deduped live ads for any brand.
  • Brandmov the full tracker, with weekly observations and launch/stop dates.

Top comments (0)