NexGenData

Posted on May 14 • Edited on May 18 • Originally published at thenextgennexus.com

Shopify Store Prospecting: Find + Qualify 10,000 Stores per Day

#apify #shopify #sales #prospecting

Shopify Store Prospecting: Find + Qualify 10,000 Stores per Day

If you sell anything to ecommerce merchants — a Shopify app, a 3PL integration, a bookkeeping tool, an ads agency service — your ideal customer profile is "runs a Shopify store, is doing at least X in revenue, has already installed Y other apps." That last signal is the interesting one. App installs on Shopify are a near-perfect buyer-intent proxy: a merchant who installed Klaviyo and ReCharge is three to five times more likely to buy a retention tool than one running a bare theme.

The problem is that no list of Shopify stores exists. Shopify does not publish a merchant directory. BuiltWith sells an approximate list starting at $295/month. Store Leads sells one starting at $75/month. Both are good, both are incomplete, and neither lets you re-qualify against your own criteria without constantly re-exporting.

What you can do instead: build your own prospecting stack. Scrape a niche (from Google results, directory sites, or an industry list), detect which of those URLs are actually Shopify stores, introspect the installed apps and theme on each, pull the product catalog to estimate revenue band, and rank the qualified shortlist by intent. At 10,000 stores per day and a marginal cost of cents per store, this replaces both BuiltWith and Store Leads for your specific use case.

This post walks through that stack end-to-end — three actors, one scoring function, about 80 lines of Python.

Grounding Numbers

Shopify reported 4.8 million merchants on the platform as of Q3 2025. Roughly 2.1 million have at least one product, custom domain, and live checkout — what we will call "active stores." The other 2.7 million are parked, test stores, or dormant.

Active Shopify stores break down roughly as: 68% using the free Dawn or similar native theme, 32% using a paid theme. 84% have installed at least 3 third-party apps; the median active store runs 7-11 apps. Stores doing more than $1M GMV (about 15% of active stores) average 18-25 apps installed.

Shopify itself takes a 15-30% cut of app revenue, meaning total app spend is observable in aggregate. Shopify's 2025 earnings deck reported partner ecosystem revenue of $1.8B, implying roughly $6B in merchant app spend — an average of about $3,000/year per active store on apps alone. That is your real addressable market.

For prospecting, the useful statistic is that app installs correlate with store GMV more strongly than Alexa rank, product count, or domain age. A 2024 Shopify Partner Academy analysis found r=0.71 between installed-app count and self-reported GMV band across ~20k surveyed merchants. Which is why the pipeline below makes app count the primary qualification signal.

Why This Is Hard

Three reasons you cannot just curl a list of Shopify stores.

No merchant directory. Shopify's /admin is per-store, authenticated, and exposes nothing merchant-facing. Shopify has a public "success stories" page but it only covers a few hundred flagship merchants.
Shopify detection is not a one-liner. You can look for /cdn.shopify.com references, the Shopify string in HTTP headers, the presence of /products.json, or a Shopify-Analytics script tag. Each signal has false positives and false negatives. Custom headless builds (Hydrogen, Next.js Commerce using Shopify backend) hide most of these. You need multiple signals and an OR with confidence scoring.
Installed-app detection is actively obscured. Shopify used to expose /cart.js with a full script list including app scripts. In 2023 they started bundling and obfuscating these to improve pageload. Modern app detection requires checking: inline script src patterns (*.cdn.apps.shopify.com, cdn-shopify-*), meta tags (klaviyo-site-id, gorgias-widget-id), cookies (_recharge_session), and DNS subdomains (*.myshopify.com CNAMEs for apps like Judge.me). Each app has a distinct fingerprint.

Architecture

Three actors feed into one ranking step:

  [niche seed query]
  e.g. "sustainable pet products"
        |
        v
  [Google / directory scraper]
        |
        v
  [list of candidate URLs]
        |
        v
  +------------------------+
  | shopify-store-detector |   --> is_shopify, confidence
  +------------------------+
        |
        v  (keep confirmed Shopify stores)
        |
  +------------------------+
  | shopify-analyzer       |   --> installed apps, theme,
  |                        |       revenue estimate, traffic band
  +------------------------+
        |
        v
  +------------------------+
  | shopify-product-scraper|   --> product count, price range,
  |                        |       inventory depth
  +------------------------+
        |
        v
       [rank]
  (app_count × theme_tier × product_count × niche_fit)
        |
        v
     [shortlist]

The three actors share a common store URL field, so you can chain them in sequence and pass results through. At 10,000 candidate URLs fanned in, the full pipeline runs in roughly 90 minutes on Apify's standard compute and costs $20-40.

Code: End-to-End Prospecting Run

The three actors: shopify-store-detector, shopify-analyzer, and shopify-product-scraper.

from apify_client import ApifyClient

client = ApifyClient("APIFY_TOKEN")

# Step 0: get candidate URLs. For a real run, feed in from Google SERP,
# an industry directory, or a curated seed list. We'll use 5 for the example.
candidates = [
    "https://www.beardbrand.com",
    "https://www.allbirds.com",
    "https://www.thefeed.com",
    "https://www.randomstore.example",
    "https://www.wildearth.com",
]

# Step 1: detect Shopify
detect_run = client.actor("nexgendata/shopify-store-detector").call(run_input={
    "urls": candidates,
    "confidence_threshold": 0.7,
})
detected = list(client.dataset(detect_run["defaultDatasetId"]).iterate_items())
shopify_urls = [d["url"] for d in detected if d["is_shopify"]]
print(f"Detected {len(shopify_urls)} Shopify stores out of {len(candidates)}")

# Step 2: analyze installed apps + theme
analyze_run = client.actor("nexgendata/shopify-analyzer").call(run_input={
    "urls": shopify_urls,
    "include_apps": True,
    "include_theme": True,
    "include_traffic_band": True,
})
analyzed = {a["url"]: a for a in client.dataset(analyze_run["defaultDatasetId"]).iterate_items()}

# Step 3: pull product catalog
products_run = client.actor("nexgendata/shopify-product-scraper").call(run_input={
    "urls": shopify_urls,
    "max_products_per_store": 500,
})

from collections import defaultdict
products = defaultdict(list)
for p in client.dataset(products_run["defaultDatasetId"]).iterate_items():
    products[p["store_url"]].append(p)

# Step 4: rank
def qualify(url):
    a = analyzed.get(url, {})
    prods = products.get(url, [])
    app_count = len(a.get("apps", []))
    theme_tier = 2 if a.get("theme", {}).get("paid") else 1
    n_prod = len(prods)
    price_band = "high" if n_prod and sum(p["price"] for p in prods) / n_prod > 50 else "low"
    score = app_count * theme_tier * (min(n_prod, 200) / 200)
    return {
        "url": url,
        "apps": app_count,
        "theme": a.get("theme", {}).get("name"),
        "paid_theme": theme_tier == 2,
        "products": n_prod,
        "price_band": price_band,
        "score": round(score, 1),
    }

ranked = sorted([qualify(u) for u in shopify_urls], key=lambda x: -x["score"])
for r in ranked:
    print(r)

Sample output for the candidate list above:

{'url': 'https://www.allbirds.com', 'apps': 24, 'theme': 'custom', 'paid_theme': True, 'products': 180, 'score': 43.2}
{'url': 'https://www.beardbrand.com', 'apps': 18, 'theme': 'Impact', 'paid_theme': True, 'products': 95, 'score': 17.1}
{'url': 'https://www.wildearth.com', 'apps': 14, 'theme': 'Prestige', 'paid_theme': True, 'products': 62, 'score': 8.7}
{'url': 'https://www.thefeed.com', 'apps': 11, 'theme': 'Dawn', 'paid_theme': False, 'products': 230, 'score': 12.7}

The ranking surfaces Allbirds (large, mature, 24 apps) and Beardbrand (mid-size, 18 apps, paid theme) as the best-qualified prospects for a retention tool. The Feed has more products but fewer apps and a free theme — more like a catalog site than a sophisticated DTC operation.

Worked Example: Sustainable Pet Products Niche

Say you run a Shopify app for subscription management. Your ideal customer is a DTC brand with 10+ apps installed, a paid theme, 50+ SKUs, and average order value above $30. You want a shortlist of 50 sustainable pet-product brands to pitch.

Step-by-step:

Seed with 200 candidate URLs. Pull them from a combination of: Google search for "sustainable pet" OR "eco dog" OR "organic cat food" site:*.com, the top 500 merchants in Shopify's pet category from a prior Store Leads export, and a scrape of r/dogs and r/cats merchant recommendations. Dedupe to ~180 unique domains.
Run shopify-store-detector against all 180. Typically 55-65% return positive with confidence >0.7. Say 112 confirmed Shopify stores.
Run shopify-analyzer against the 112. This returns, per store: the list of detected installed apps (usually 4-25 apps with known fingerprints), the theme name and whether it is paid, and an estimated monthly traffic band (low/medium/high) based on SimilarWeb-style signals.
Filter: keep stores with 10+ apps AND a paid theme. Typically 35-45% pass. Say 44 stores.
Run shopify-product-scraper against the 44 stores, capping at 500 products each. Calculate per-store: product count, median price, variant count (proxy for catalog complexity).
Final filter: keep stores with 50+ products AND median price >$30. Say 28 stores.
Add a qualitative signal: does the current app stack include a competitor to your product? If yes, they are aware of the category — good for upsell, tricky for displacement. If no, and they have a retention gap (no post-purchase app, no loyalty tool), they are your sweet spot.

Total run time: about 12 minutes of actor runtime, $3-5 in Apify credits. Output: a ranked CSV with 28 qualified pet-brand Shopify stores, each with contact page URL, installed apps, theme, product count, AOV band, and a score. That is an afternoon of BDR work compressed into a coffee break, with better data than most BDRs would have found manually.

Gotchas

Things that regularly break Shopify prospecting pipelines:

Hydrogen and headless Shopify hide almost everything. Allbirds ran on Hydrogen for a while and the detector's HTML-signal score dropped. The analyzer has a fallback that checks the Storefront API on a guessed subdomain, but headless stores will underreport installed apps by 30-50% because many apps run only on the Liquid storefront.
Plus stores sometimes spoof headers. Shopify Plus merchants on enterprise plans occasionally strip the x-shopify-stage and related headers for security. The detector usually still picks them up via /products.json response shape, but expect 2-5% false negatives on the high end of the market.
App detection has false positives. Some theme developers bundle snippets that look like Klaviyo or Privy without actually calling those services. Confidence scoring handles this in the actor, but do not treat "appears to have Klaviyo installed" as gospel — sanity check a sample manually.
Product catalog scraping can be slow. Stores with 10,000+ products paginate heavily. Cap max_products_per_store at 500-1000 for prospecting; the tail of the catalog rarely changes your qualification decision.
Password-protected and pre-launch stores. Many new stores run behind a password page for weeks. The detector flags these as is_shopify=true, store_status="password_protected". Decide upstream whether you want to pitch pre-launch stores or skip them.
Rate limits. The three actors collectively burn through requests fast. For 10,000 URLs/day sustained, you will want to run during off-peak hours and set max_concurrency lower. At peak hours, Shopify's CDN occasionally serves CAPTCHAs to the product scraper.
Theme name is sometimes "custom" or empty. Shopify Plus merchants often customize past the point where the theme fingerprint matches anything. Do not use theme name as a hard filter; use the paid-vs-free signal instead.
CDN-masked subdomains. Many stores sit behind Cloudflare, which obscures the origin. The detector still works (it reads the HTML), but you will lose some hosting-band intelligence.

FAQ

How does this compare to BuiltWith or Store Leads?
BuiltWith and Store Leads maintain crawled databases of all detected Shopify stores globally. Their coverage is broader than yours will ever be if you are scraping from seed URLs. But you cannot re-qualify their lists against custom criteria without exporting and re-processing. This pipeline gives you on-demand freshness and custom qualification; for deep historical coverage, Store Leads is complementary, not a competitor.

Is scraping Shopify stores legal?
Product catalogs served through /products.json are intended for public consumption — Shopify publishes this endpoint by design. HTML front pages are public. The question of legality typically turns on terms of service and jurisdiction; hiQ v. LinkedIn and subsequent cases have broadly protected public-data scraping in the US. Consult counsel for commercial use. Do not scrape checkout pages or anything behind a login.

How accurate is the revenue estimate?
The analyzer's revenue band is a proxy based on app count, theme tier, product count, and traffic estimate. Accuracy is ±1 band (so a "medium" estimate could be actually low or high). For sales qualification that is fine; for investment decisions it is not.

Can I detect what plan the merchant is on?
Shopify Plus leaves detectable fingerprints (custom checkout URLs, checkout.shopify.com usage, sometimes specific meta tags). Basic vs Shopify vs Advanced is not reliably detectable from the outside. The analyzer returns plan_estimate: {plus: true/false/unknown}.

How fresh is the app-install data?
The actor pulls live at runtime. An app installed or uninstalled 10 minutes ago will reflect in the result. This is the main advantage over Store Leads / BuiltWith, which have varying refresh cadences.

What if my target market is non-Shopify ecommerce?
The detector can be extended to BigCommerce, WooCommerce, Magento, Salesforce Commerce Cloud via fingerprint packs — the current actor supports Shopify primarily with experimental Woo and BigCommerce detection. For serious non-Shopify prospecting, you want BuiltWith's tech lookup as your seed source.

How do I find contact info for the shortlist?
The analyzer returns /contact, /pages/contact, and footer email where present. For outbound, combine with a domain-to-email tool like Hunter or your own website-email-extractor run. Do not spam; warm outreach with a personalized hook from the installed-app data converts dramatically better.

Can I schedule this weekly for fresh leads?
Yes. The three actors support scheduled runs on Apify. A reasonable cadence: weekly niche-seed re-scrape, daily re-detection of your existing tracked list to catch new stores, monthly full product catalog refresh.

Conclusion

Shopify prospecting at scale is a stack, not a tool. A detector to confirm the platform, an analyzer to read the installed-app signal, a product scraper to size the catalog, and a scoring function to rank. That stack lives entirely on public data — no Shopify API, no merchant directory license — and it costs cents per store.

The strategic point: installed apps are the most under-exploited B2B intent signal in ecommerce. A store running Klaviyo, ReCharge, Gorgias, and Rebuy is announcing its software budget and its sophistication. Every pitch to that store should reference which apps it runs. That is what owning your own prospecting pipeline enables.

Run your first niche through the shopify-store-detector, shopify-analyzer, and shopify-product-scraper on Apify. Pay per run, skip the subscription, own the output.

DEV Community

Shopify Store Prospecting: Find + Qualify 10,000 Stores per Day

Shopify Store Prospecting: Find + Qualify 10,000 Stores per Day

Grounding Numbers

Why This Is Hard

Architecture

Code: End-to-End Prospecting Run

Worked Example: Sustainable Pet Products Niche

Gotchas

FAQ

Conclusion

Top comments (0)