Souymodeep Banerjee

Posted on Apr 21

How I scrape and de-dupe Meta ads for 1000 brands

#python #playwright #dataengineering

I run Brandmov, a tracker for what DTC brands are running as Meta ads. Behind it: a weekly pipeline, a lot of brand seeds, a table that grows over time.

This isn't a recipe. No selectors, no endpoint names those posts break in six weeks. These are the six problems you can't skip once your scraper becomes a scheduled job that has to survive for a year.

Want to see the output? Competitor Ads Lookup free, no signup, paste a brand, get deduped live ads.

1. Bot detection is a decision, not a bug

Stances, cheapest → most expensive:

Browser mimicry. Real viewport, locale, timezone, pointer cadence, no automation flags leaking.
IP rotation. Residential or mobile proxies, per-session or per-request, geo-matched to content.
Gate solving. Captcha APIs, token relays, third-party solver SLAs.
Human-in-loop. Operator clears challenge, session stays warm for N minutes.

Fingerprint surface to audit on your own client:

TLS / JA3 / JA4 headless browsers ship with distinguishable TLS stacks.
HTTP/2 frame order and SETTINGS values.
navigator.webdriver, Chrome DevTools Protocol flags, missing chrome global.
Canvas, WebGL, AudioContext fingerprints.
Font list, plugin list, Accept-Language vs timezone mismatch.
Mouse jitter, scroll velocity, keypress dwell time.

Per-session telemetry to emit (before you need it):

Challenge shown? (yes/no/type)
HTTP error rate in window.
Response latency distribution vs baseline.
Empty / truncated payload rate.
4xx by code, 5xx by code, explicit rate-limit codes.

Escalation policy:

Start at the cheapest stance that has ever worked for this target.
Promote on two consecutive sessions failing the health check.
Demote after K clean sessions.
Promote is config, not code.

Common traps:

Setting a custom User-Agent but forgetting to match sec-ch-ua hints.
Spoofing timezone but not Intl.DateTimeFormat().resolvedOptions().timeZone.
Proxy pools that share IPs across tenants you inherit someone else's ban.

2. Your queue is the system. The scraper is a worker.

What the for-loop can't do, the queue can:

At-least-once delivery with idempotency keys.
Visibility timeouts sized for worst-case task duration × 1.5.
Explicit retry counter with max-attempts.
Dead-letter lane on attempt N.
Resume after host restart without replaying successful work.
Per-target concurrency caps, independent of worker count.
Backlog metrics per priority tier.

Queue options, rough comparison:

Option	Durability	Ordering	Fairness primitives	Ops cost
SQS	high	FIFO optional	none native	low
Redis Streams	medium	per-stream	consumer groups	low
NATS JetStream	high	per-subject	subject hierarchy	medium
Postgres-backed	high	SQL-controlled	full SQL	free if you already run PG

Fairness patterns:

Oldest-first. Priority = now - last_success. Starves nothing.
Weighted round-robin. Each category gets a fixed slice of each tick.
Stratified sampling. Sample N from each bucket per run, not N from the whole set.
SLA tiers. Hot tier runs daily, warm tier weekly, cold tier monthly.

Poison detection:

Same error class on attempt 3 → dead-letter.
Rising cost with no data returned → dead-letter.
Seed that has never succeeded after 30 days → quarantine.

Visibility timeout sizing rule:

Start at p95(task_duration) × 1.5.
If you see duplicate-execution incidents, raise it.
If you see stuck-seed incidents, lower it and diagnose the slow task.

3. Dedupe on the platform's ID, or pay for it

Options, with failure modes:

Platform ID. Cheapest, correct when stable. Trap: IDs rotate across schema migrations, or are session-scoped and look stable for a single run. Probe longitudinally before trusting.
Content hash. Cheap, almost never correct. Trap: whitespace diffs, CDN URL rotation, A/B copy tests, localized variants, truncation.
Probabilistic match. Necessary cross-source. Wrong-merges are harder to audit than misses.

ID stability probe (run once before trusting an ID):

Capture the same entity weekly for ≥ 4 weeks.
Confirm ID is stable across: different geo, different sort, different pagination offset.
Confirm ID survives a platform-side field rename.
If any fail, the ID is session-scoped; fall back to content hash + field triangulation.

What "invariant content" actually means:

Not the title (A/B tests it).
Not the body (localized).
Not the media URL (CDN-rotated).
Not the status (time-varying).
Maybe the creation date + brand + structural shape (card count, format).

Counters every dedupe path must emit:

dedupe.merged two payloads collapsed into one record.
dedupe.distinct_same_shape two records with identical non-ID fields, different IDs.
dedupe.conflict same ID, contradictory immutable fields.
dedupe.first_seen new record.

The counter that tells you the platform changed: distinct_same_shape spikes.

4. Store every sighting. Never `UPDATE`.

Schema shape (conceptual):

Entity identity only. entity_id, created-at, brand FK.
Observation append-only. observation_id, entity_id FK, observed_at, payload snapshot, run FK.
Run metadata about each pipeline execution.
Views: latest_observation_per_entity, first_seen, last_seen.

Indexes you'll need within three months:

(entity_id, observed_at DESC) latest observation.
(run_id, entity_id) reconstruct a run.
(brand_id, first_seen) launch timelines.
Partial index on last_seen_before_now - 7d "stopped in the last week."

Late-arriving observations:

Allow observed_at < max(observed_at) for the entity backfills happen.
Never infer last_seen as max(observed_at) blindly; store it as a derived column, rebuild on backfill.

Retention:

Hot (queryable): last 90 days, in primary DB.
Warm (occasional): 90d–2y, in columnar storage (Parquet on object store).
Cold (audit): > 2y, archived.

Snapshot tables for hot queries:

Materialize "currently-live entities per brand" nightly.
Don't rebuild it per request.
Invalidate on run completion, not on observation write.

5. Slow down on purpose

Rate-limit layers:

Token bucket, per target, local. N per minute. Your budget ≤ 50% of observed tolerance.
Token bucket, per session. Avoids burst even when global budget allows.
Concurrent-session cap, per target. Usually 1. Parallelism here buys bans.

Circuit breaker states:

closed normal.
open skip this target entirely until cooldown.
half-open next run sends a single canary; success returns to closed, failure re-opens.

Backoff rules:

Between runs only. Never inside a burned session.
Exponential with full jitter: sleep = rand(0, base × 2^attempt).
Cap at max_backoff (e.g. 24h) and alert.

Structured signals to respect unconditionally:

HTTP 429.
Retry-After header honor exact value, don't halve it.
Platform-specific rate-limit error codes in payload body.
Explicit CAPTCHA interstitials.

Signals to ignore:

"200 OK with empty body" treat as soft failure, not success.
"Slightly slower response times" in range of noise.

Session hygiene:

Rotate session on circuit-breaker trip, not per request.
Cache warm sessions with TTL; a warm session that just completed a clean run is gold.
Log session lineage: how many entities has this session touched, how old is it, what's its error rate.

6. Keep the raw. It's the cheapest insurance.

Storage layout:

raw/{target}/{run_id}/{seed_id}/{sequence}.json.zst
Manifest per run: list of all raw artifacts, hashes, byte counts.
Content-addressed storage optional; by-run layout is usually enough.

Compression:

Zstd at level 3–6 for JSON payloads. 5–10× smaller than raw.
Don't compress per-request; batch at the run boundary.

Replayability requirements:

Raw + run metadata + parser version = deterministic normalized output.
Parser takes raw input only, no network calls during re-parse.
Parser version is tagged per output row.

Cold tier policy:

Raw > 180 days → move to archive tier (cheaper, slower retrieval).
Keep indexes on what's in archive, not the archive itself.

What raw re-parsing has rescued (one year, three incidents):

Silently dropped nested field re-parsed, backfilled.
New attribute added recovered six months of history with no rescrape.
Platform shipped a new record variant re-parsed old runs, reclassified.

Rule: fetcher and parser must be separate processes with a durable artifact between them. If they're one process, you can't evolve.

The short version

Bot detection is a stance. Escalate on signal, don't hard-code.
Use a real queue. Fairness is the product's freshness SLA.
Dedupe on the platform's ID. Probe stability first.
Store every sighting. History is the product.
Be a polite guest. Token buckets, circuit breakers, backoff between runs.
Keep the raw. Parser and fetcher must be separable.

Everything else is tuning. Get the shape right, numbers come later.

Competitor Ads Lookup free, no signup, deduped live ads for any brand.
Brandmov the full tracker, with weekly observations and launch/stop dates.

DEV Community

How I scrape and de-dupe Meta ads for 1000 brands

1. Bot detection is a decision, not a bug

2. Your queue is the system. The scraper is a worker.

3. Dedupe on the platform's ID, or pay for it

4. Store every sighting. Never `UPDATE`.

5. Slow down on purpose

6. Keep the raw. It's the cheapest insurance.

The short version

Top comments (0)

1. Bot detection is a decision, not a bug

2. Your queue is the system. The scraper is a worker.

3. Dedupe on the platform's ID, or pay for it

4. Store every sighting. Never UPDATE.

5. Slow down on purpose

6. Keep the raw. It's the cheapest insurance.

The short version

4. Store every sighting. Never `UPDATE`.