I run Brandmov, a tracker for what DTC brands are running as Meta ads. Behind it: a weekly pipeline, a lot of brand seeds, a table that grows over time.
This isn't a recipe. No selectors, no endpoint names those posts break in six weeks. These are the six problems you can't skip once your scraper becomes a scheduled job that has to survive for a year.
Want to see the output? Competitor Ads Lookup free, no signup, paste a brand, get deduped live ads.
1. Bot detection is a decision, not a bug
Stances, cheapest → most expensive:
- Browser mimicry. Real viewport, locale, timezone, pointer cadence, no automation flags leaking.
- IP rotation. Residential or mobile proxies, per-session or per-request, geo-matched to content.
- Gate solving. Captcha APIs, token relays, third-party solver SLAs.
- Human-in-loop. Operator clears challenge, session stays warm for N minutes.
Fingerprint surface to audit on your own client:
- TLS / JA3 / JA4 headless browsers ship with distinguishable TLS stacks.
- HTTP/2 frame order and SETTINGS values.
-
navigator.webdriver, Chrome DevTools Protocol flags, missingchromeglobal. - Canvas, WebGL, AudioContext fingerprints.
- Font list, plugin list,
Accept-Languagevs timezone mismatch. - Mouse jitter, scroll velocity, keypress dwell time.
Per-session telemetry to emit (before you need it):
- Challenge shown? (yes/no/type)
- HTTP error rate in window.
- Response latency distribution vs baseline.
- Empty / truncated payload rate.
- 4xx by code, 5xx by code, explicit rate-limit codes.
Escalation policy:
- Start at the cheapest stance that has ever worked for this target.
- Promote on two consecutive sessions failing the health check.
- Demote after K clean sessions.
- Promote is config, not code.
Common traps:
- Setting a custom
User-Agentbut forgetting to matchsec-ch-uahints. - Spoofing timezone but not
Intl.DateTimeFormat().resolvedOptions().timeZone. - Proxy pools that share IPs across tenants you inherit someone else's ban.
2. Your queue is the system. The scraper is a worker.
What the for-loop can't do, the queue can:
- At-least-once delivery with idempotency keys.
- Visibility timeouts sized for worst-case task duration × 1.5.
- Explicit retry counter with max-attempts.
- Dead-letter lane on attempt N.
- Resume after host restart without replaying successful work.
- Per-target concurrency caps, independent of worker count.
- Backlog metrics per priority tier.
Queue options, rough comparison:
| Option | Durability | Ordering | Fairness primitives | Ops cost |
|---|---|---|---|---|
| SQS | high | FIFO optional | none native | low |
| Redis Streams | medium | per-stream | consumer groups | low |
| NATS JetStream | high | per-subject | subject hierarchy | medium |
| Postgres-backed | high | SQL-controlled | full SQL | free if you already run PG |
Fairness patterns:
-
Oldest-first. Priority =
now - last_success. Starves nothing. - Weighted round-robin. Each category gets a fixed slice of each tick.
- Stratified sampling. Sample N from each bucket per run, not N from the whole set.
- SLA tiers. Hot tier runs daily, warm tier weekly, cold tier monthly.
Poison detection:
- Same error class on attempt 3 → dead-letter.
- Rising cost with no data returned → dead-letter.
- Seed that has never succeeded after 30 days → quarantine.
Visibility timeout sizing rule:
- Start at
p95(task_duration) × 1.5. - If you see duplicate-execution incidents, raise it.
- If you see stuck-seed incidents, lower it and diagnose the slow task.
3. Dedupe on the platform's ID, or pay for it
Options, with failure modes:
- Platform ID. Cheapest, correct when stable. Trap: IDs rotate across schema migrations, or are session-scoped and look stable for a single run. Probe longitudinally before trusting.
- Content hash. Cheap, almost never correct. Trap: whitespace diffs, CDN URL rotation, A/B copy tests, localized variants, truncation.
- Probabilistic match. Necessary cross-source. Wrong-merges are harder to audit than misses.
ID stability probe (run once before trusting an ID):
- Capture the same entity weekly for ≥ 4 weeks.
- Confirm ID is stable across: different geo, different sort, different pagination offset.
- Confirm ID survives a platform-side field rename.
- If any fail, the ID is session-scoped; fall back to content hash + field triangulation.
What "invariant content" actually means:
- Not the title (A/B tests it).
- Not the body (localized).
- Not the media URL (CDN-rotated).
- Not the status (time-varying).
- Maybe the creation date + brand + structural shape (card count, format).
Counters every dedupe path must emit:
-
dedupe.mergedtwo payloads collapsed into one record. -
dedupe.distinct_same_shapetwo records with identical non-ID fields, different IDs. -
dedupe.conflictsame ID, contradictory immutable fields. -
dedupe.first_seennew record.
The counter that tells you the platform changed: distinct_same_shape spikes.
4. Store every sighting. Never UPDATE.
Schema shape (conceptual):
-
Entityidentity only.entity_id, created-at, brand FK. -
Observationappend-only.observation_id,entity_idFK,observed_at, payload snapshot, run FK. -
Runmetadata about each pipeline execution. - Views:
latest_observation_per_entity,first_seen,last_seen.
Indexes you'll need within three months:
-
(entity_id, observed_at DESC)latest observation. -
(run_id, entity_id)reconstruct a run. -
(brand_id, first_seen)launch timelines. - Partial index on
last_seen_before_now - 7d"stopped in the last week."
Late-arriving observations:
- Allow
observed_at<max(observed_at)for the entity backfills happen. - Never infer
last_seenasmax(observed_at)blindly; store it as a derived column, rebuild on backfill.
Retention:
- Hot (queryable): last 90 days, in primary DB.
- Warm (occasional): 90d–2y, in columnar storage (Parquet on object store).
- Cold (audit): > 2y, archived.
Snapshot tables for hot queries:
- Materialize "currently-live entities per brand" nightly.
- Don't rebuild it per request.
- Invalidate on run completion, not on observation write.
5. Slow down on purpose
Rate-limit layers:
- Token bucket, per target, local. N per minute. Your budget ≤ 50% of observed tolerance.
- Token bucket, per session. Avoids burst even when global budget allows.
- Concurrent-session cap, per target. Usually 1. Parallelism here buys bans.
Circuit breaker states:
-
closednormal. -
openskip this target entirely until cooldown. -
half-opennext run sends a single canary; success returns toclosed, failure re-opens.
Backoff rules:
- Between runs only. Never inside a burned session.
- Exponential with full jitter:
sleep = rand(0, base × 2^attempt). - Cap at
max_backoff(e.g. 24h) and alert.
Structured signals to respect unconditionally:
- HTTP 429.
-
Retry-Afterheader honor exact value, don't halve it. - Platform-specific rate-limit error codes in payload body.
- Explicit CAPTCHA interstitials.
Signals to ignore:
- "200 OK with empty body" treat as soft failure, not success.
- "Slightly slower response times" in range of noise.
Session hygiene:
- Rotate session on circuit-breaker trip, not per request.
- Cache warm sessions with TTL; a warm session that just completed a clean run is gold.
- Log session lineage: how many entities has this session touched, how old is it, what's its error rate.
6. Keep the raw. It's the cheapest insurance.
Storage layout:
raw/{target}/{run_id}/{seed_id}/{sequence}.json.zst- Manifest per run: list of all raw artifacts, hashes, byte counts.
- Content-addressed storage optional; by-run layout is usually enough.
Compression:
- Zstd at level 3–6 for JSON payloads. 5–10× smaller than raw.
- Don't compress per-request; batch at the run boundary.
Replayability requirements:
- Raw + run metadata + parser version = deterministic normalized output.
- Parser takes raw input only, no network calls during re-parse.
- Parser version is tagged per output row.
Cold tier policy:
- Raw > 180 days → move to archive tier (cheaper, slower retrieval).
- Keep indexes on what's in archive, not the archive itself.
What raw re-parsing has rescued (one year, three incidents):
- Silently dropped nested field re-parsed, backfilled.
- New attribute added recovered six months of history with no rescrape.
- Platform shipped a new record variant re-parsed old runs, reclassified.
Rule: fetcher and parser must be separate processes with a durable artifact between them. If they're one process, you can't evolve.
The short version
- Bot detection is a stance. Escalate on signal, don't hard-code.
- Use a real queue. Fairness is the product's freshness SLA.
- Dedupe on the platform's ID. Probe stability first.
- Store every sighting. History is the product.
- Be a polite guest. Token buckets, circuit breakers, backoff between runs.
- Keep the raw. Parser and fetcher must be separable.
Everything else is tuning. Get the shape right, numbers come later.
- Competitor Ads Lookup free, no signup, deduped live ads for any brand.
- Brandmov the full tracker, with weekly observations and launch/stop dates.
Top comments (0)