Anders Myrmel

Posted on Mar 12

How I Scrape 250,000 Shopify Stores Without Getting Blocked

#webdev #webscraping #shopify #javascript

Right-click any Shopify store. View Source. You'll see <script> tags from every app they've installed, a Shopify.theme object with their exact theme, and tracking pixels from every ad platform they use. None of this is hidden.

I wanted to scrape all of it across 250K stores. That's the detection engine behind StoreInspect, where I map the Shopify ecosystem by scanning what stores run.

This post is the technical walkthrough. What worked, what didn't, and the parts that took way longer than expected.

The stack

Rebrowser-Puppeteer (not regular Puppeteer)
PostgreSQL with JSONB snapshots
Webshare proxies + Tailscale SOCKS5 tunnels
Detection logic bundled as a string so it runs in both Puppeteer and a Chrome Extension

Why not regular Puppeteer

Standard Puppeteer gets flagged immediately. Shopify itself doesn't block scrapers aggressively, but Cloudflare and bot detection on individual stores will. Rebrowser is a drop-in replacement that patches the webdriver property and fixes the obvious fingerprinting leaks.

I also set a real viewport (1920x1080), proper Chrome 131 user agent strings, and route through residential proxies. Nothing exotic. The bar for scraping Shopify stores is low because the data is public. You just need to not look like a bot.

Page loading strategy

First mistake I made: using networkidle2 as the wait condition. Shopify stores have analytics scripts, chat widgets, and ad pixels that fire continuously. networkidle2 waits for 500ms of network silence, which sometimes never comes.

Switched to domcontentloaded plus a flat 5-second delay:

await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
await new Promise(r => setTimeout(r, 5000));

The 5 seconds lets lazy-loaded scripts, GTM tags, and deferred pixels fire. Not elegant, but it catches 95%+ of what I need.

I also block images, fonts, and media via request interception. I only need the HTML and scripts. This cut page load times by about 60%.

The four detection layers

One method doesn't catch everything, so I use four:

1. Script URL matching

Most Shopify apps inject a script tag with a recognizable domain. Klaviyo loads from static.klaviyo.com. Judge.me loads from judge.me. Yotpo loads from staticw2.yotpo.com. Match the domain, you've identified the app.

This is the most reliable method. About 70% of detections come from script URLs.

2. JavaScript globals

Apps set window variables when they initialize. Klaviyo sets window.klaviyo and window._learnq. Gorgias sets window.GorgiasChat. TikTok's pixel sets window.ttq.

I check these with page.evaluate() after the page loads. Useful as a second signal when script URLs are obfuscated or loaded through a tag manager.

3. DOM elements

Some apps only inject UI. A chat bubble, a reviews widget, a popup form. CSS selectors like .jdgm-widget (Judge.me) or [data-yotpo-instance-id] (Yotpo) catch these.

4. Theme App Extension blocks

This one took me a while to find. Shopify's Online Store 2.0 lets apps inject server-rendered blocks, and Shopify wraps them in HTML comments:

<!-- BEGIN app block: shopify://apps/judge-me-reviews/blocks/preview_badge/... -->

These are gold. They map directly to the app's Shopify App Store slug. A store can have zero client-side scripts from an app but still have its Theme App Extension block in the HTML. I maintain a map from Shopify slugs to app IDs to catch these.

The cookie consent problem

This one cost me a week. I was getting pixel detection rates way below what I expected. Meta Pixel was showing up on maybe 40% of stores when industry benchmarks say 50%+.

The problem: cookie consent managers. OneTrust, Cookiebot, and similar tools block ad pixels from loading until the user clicks "Accept." My scraper never clicks accept, so the pixels never fire.

Fix:

await page.evaluate(() => {
  document.getElementById('onetrust-accept-btn-handler')?.click();
  document.querySelector('button[id*="accept"]')?.click();
});
await new Promise(r => setTimeout(r, 8000));

Click accept, wait 8 seconds for GTM to process and load the blocked tags. Pixel detection accuracy went from roughly 60% to 95%.

The 8-second wait feels long but it's necessary. GTM doesn't fire tags synchronously after consent.

Detection bundle architecture

I wanted the same detection code in both the Puppeteer scraper (server-side) and a Chrome Extension (client-side). The signatures for 180+ apps, 40+ pixels, and theme detection logic need to be identical.

The solution: bundle everything into a single self-executing function string.

const detectionScript = `(function() {
  const APP_SIGNATURES = { /* 180 apps */ };
  const PIXEL_SIGNATURES = { /* 40 pixels */ };
  // ... detection logic
  return { isShopify, theme, apps, pixels };
})()`;

// Puppeteer
const result = await page.evaluate(detectionScript);

// Chrome Extension (content script)
const result = eval(detectionScript);

One source of truth. When I add a new app signature, both the scraper and extension pick it up.

Storing results in JSONB

Each scrape produces a snapshot stored as JSONB in PostgreSQL:

store_snapshots: {
  store_id: int,
  apps: jsonb,     // [{name, category, detected_via}]
  pixels: jsonb,   // [{name, type, pixel_id}]
  theme: jsonb,    // {name, type, author}
  metrics: jsonb,  // {product_count, traffic_tier, ...}
  snapshot_date: timestamp
}

Why JSONB instead of normalized tables? The schema changes constantly. Every time I add a new detection field or metric, I don't want to run a migration. JSONB lets me evolve the structure without downtime.

I also keep denormalized counts on the main stores table (app_count, pixel_count, theme_name) for fast filtering and sorting. The snapshots are for historical comparison.

Error handling at scale

At 250K stores, every edge case happens. Password-protected stores return a /password redirect. Stores with lapsed billing return 402. Dead stores return 404. Some stores infinite-redirect between www and non-www.

I classify errors into retryable and non-retryable:

// Don't retry these
'dns_not_found'   // Domain doesn't resolve
'ssl_error'       // Cert problems
'store_closed'    // 402 Payment Required

// Retry with backoff
'timeout'         // Might be temporary
'blocked'         // Try different proxy
'network_error'   // Transient

DNS failures and SSL errors get marked and skipped permanently. Timeouts and blocks get retried with proxy rotation. This keeps the scraper from wasting cycles on stores that will never respond.

Regional domain deduplication

Brands like gymshark.com and gymshark.co.uk are the same store. Without deduplication, they'd appear as separate entries with identical tech stacks.

I check regional TLDs (.co.uk, .com.au, .de, .fr, etc.) against the .com version. If both exist, I skip the regional variant. Simple, but it prevented thousands of duplicates.

What I learned about Shopify's ecosystem

After scanning 250K stores: the median store runs 1 app (usually Shop Pay) and 4 pixels (usually Shopify's own pixel plus GA4). 59% have no email marketing tool. 78% have no reviews app. The ecosystem is almost empty.

The detection engine is available as a free Chrome extension if you want to try it on individual stores.

The full dataset updates daily at storeinspect.com/report/state-of-shopify.

If you have questions about the scraping setup or detection patterns, drop them in the comments.

DEV Community