Boon

Posted on May 2 • Edited on May 5

I built a Shopify scraper that detects apps + pulls products in one API call

#shopify #ecommerce #api #indiehackers

TL;DR — Existing Shopify app detectors (Koala Inspector, ShopScan, Fera, BuiltWith) are Chrome extensions or SaaS dashboards. None do batch. I had 1,200 stores to qualify and View Source + Cmd-F was killing my afternoons, so I shipped an Apify actor that takes a list of Shopify URLs and returns the full app stack (Klaviyo, Yotpo, Judge.me, Loox, ReCharge…) + product catalog + reviews in JSON. No headless browser, ~$0.005 per store, 1,000 stores in 25 minutes. Live here → Shopify Scraper – Apps Spy + Reviews.

The afternoon that broke me

Six weeks ago I was prospecting for a B2B side-project. The hypothesis: "Shopify stores running Klaviyo + a paid reviews app are the right ICP — they spend money on retention tooling, so they will pay for ours."

To validate, I needed a list of Shopify stores and their installed apps.

The Shopify App Store does not give you that. The "stores using X" databases do, but the public ones are stale and the good ones are paid SaaS at $99–499/month for filters I did not need.

So I did what every founder does at 11 PM: I opened View Source on a competitor list, hit Cmd-F, and started typing klaviyo.

It worked. Sort of. I did 40 stores in two hours, then stopped, because I had a list of 1,200.

That night I wrote the first version of what is now Shopify Scraper – Apps Spy + Reviews.

What I actually wanted

Every "Shopify scraper" I found online did one of two things:

Scraped a single store's products via /products.json — table-stakes, dozens of free Apify actors do it.
Spawned a headless browser to fingerprint a marketing site — slow, expensive, and brittle.

I wanted three things in one pass:

Full product catalog (titles, prices, variants, images, vendor, tags) — nothing exotic.
App detection: which third-party Shopify apps are installed (email, reviews, subscriptions, popups, search).
Reviews when a reviews app is detected — pull them via the public API, not by parsing widgets.

And I wanted it to be cheap, because I had ~1,200 stores in my first batch and I planned to run it monthly.

The "no headless browser" decision

The thing nobody tells you about Shopify scraping is that you almost never need a headless browser. The signals you want for app detection live in three places, and all three are reachable with a plain HTTPS GET:

The HTML of the homepage. Shopify apps inject <script> tags from their own CDN. cdn.judge.me, cdn.yotpo.com, loox.io/widget, klaviyo.com/onsite — you grep the HTML and you know.
/products.json. Shopify exposes the full catalog at this path on every store, paginated 250 items at a time. No auth, no headless. (You hit a soft rate limit around 2 req/s per IP, which is fine if you queue politely.)
App-specific public endpoints. Judge.me has a JSON reviews endpoint. Yotpo too. Same for Loox, Stamped, Reviews.io. Once you know which app is installed, you go straight to its API — no DOM parsing.

The whole actor is built around that observation. No Puppeteer, no Playwright, no proxy farm. Just got-scraping, cheerio, and p-queue to keep concurrency civilized.

The result is that scanning a single store costs ~3–6 HTTPS requests and runs in 2 to 8 seconds depending on catalog size. Cost on Apify infra: about $0.005 per store for the "tech stack only" mode.

The architecture (it is small on purpose)

I'll be honest — I almost over-engineered this. My first draft had Redis for de-dup, a queue, retry logic with exponential backoff, and a state machine. Then I deleted all of it.

Here is what shipped:

src/
├── main.js                   # orchestration (p-queue, per-store flow)
├── crawlers/
│   ├── products.js           # /products.json + sitemap fallback
│   ├── apps.js               # detect apps from homepage HTML
│   └── reviews.js            # per-app reviews fetchers
└── lib/
    ├── normalize.js          # canonicalize URLs, normalize product schema
    ├── schemas.js            # zod validation for outputs
    └── billing.js            # Apify pay-per-event charges

A run goes:

Canonicalize the store URL (handles www, custom domains, *.myshopify.com).
Fetch the homepage once. Confirm it is Shopify (the x-shopify-stage header is a giveaway).
From the same HTML, run the app detectors. Each detector is ~10 lines of regex matching against script tags + meta tags + inline JSON.
Fetch /products.json?page=N until you hit the cap or run out of products.
If the user asked for reviews and an installed reviews app was detected, fan out to that app's public reviews API.

That is it. The whole thing is ~900 lines of JavaScript. I run it with node --test for unit tests against snapshots and a tests/smoke-products.mjs that hits 5 real stores end-to-end.

What I learned about app detection

The regex-against-HTML approach has one trap. Shopify themes minify, version, and CDN-rewrite their assets, so you cannot match on a single string. The Klaviyo loader, for example, ships under at least four URL patterns I have seen:

static.klaviyo.com/onsite/js/klaviyo.js
static-tracking.klaviyo.com/onsite/js/...
a.klaviyo.com/media/...
inline _learnq queue calls

You match any of those, and you call it Klaviyo. Same logic for every other app — every detector is an array of patterns, OR'd together, returning a single boolean. I wrote a snapshot test per app with a real store HTML page so a Klaviyo URL change does not silently break detection.

The detectors I shipped on day one:

Email/SMS: Klaviyo, Omnisend, Postscript, Mailchimp, Attentive
Reviews: Yotpo, Judge.me, Loox, Stamped, Reviews.io, Okendo
Subscriptions: ReCharge, Bold, Loop, Skio
Popups & SMS capture: Privy, Justuno, Klaviyo Forms
Search & discovery: Searchanise, Boost, Algolia
Loyalty: Smile.io, Yotpo Loyalty, LoyaltyLion

If you tell me an app I missed, I add a detector. Each one is a 15-minute job.

The pay-per-event pricing problem

Apify lets you charge per event instead of per compute minute. For a scraper that runs in seconds, this is the right model — your customer pays for the rows they get, not for compute time.

The mistake I made on my first push was leaving Apify's default dataset_item event on. Combined with my custom product_extracted event, every product was being charged twice. I caught it in monetization review and removed the synthetic event.

The pricing I landed on:

store_analyzed — $0.003 per store (covers detection + products fetch)
product_extracted — $0.0005 per product
apps_detected — $0.001 per store at standard+
review_extracted — $0.0003 per review

A 500-product store with reviews costs roughly $0.30 end to end. For comparison, the SaaS competitors charge $99 or more for similar lookups, batched and capped.

What surprised me

Three things, in order of how badly I underestimated them.

1. /products.json is more honest than the storefront. It exposes products that are unpublished from the theme but still live (out-of-stock holdovers, B2B-only SKUs). Useful for trend research. Sometimes shocking.

2. Reviews-app detection is a lead signal. A store on Judge.me Free plan vs. Yotpo Premium tells you a lot about their stage. I ended up using this internally to prioritize cold outbound — different pitch for a $30/month stack vs. a $1,200/month stack.

3. People want this as an MCP server. Two of my first three users asked if they could query it from Claude / ChatGPT. I have it on the roadmap. (My GPT Crawler MCP and Vinted MCP Server are the two MCP actors I shipped first; the Shopify one is next.)

How to use it in one minute

// On Apify, paste this in the actor input box
{
  "store_urls": ["https://allbirds.com", "https://gymshark.com"],
  "extract_level": "standard",  // products + apps stack
  "max_products_per_store": 250
}

Output (one record per product, with apps_detected attached):

{
  "store_domain": "allbirds.com",
  "product_title": "Wool Runner",
  "price": 110,
  "available": true,
  "vendor": "Allbirds",
  "apps_detected": {
    "email": ["Klaviyo"],
    "reviews": ["Yotpo"],
    "subscriptions": [],
    "search": ["Searchanise"]
  },
  "product_url": "https://allbirds.com/products/mens-wool-runners"
}

If you want reviews, set extract_level: "full" and a max_reviews_per_product. The actor will route to the correct reviews API based on what was detected.

Direct link, free $5 credit on Apify, no account-creation drama: Shopify Scraper – Apps Spy + Reviews.

FAQ

Is scraping `/products.json` allowed?

Shopify exposes /products.json publicly on every store by default. Stores that disable it (rare) return 404 and the actor logs a skip. The actor never authenticates, never bypasses access controls, and respects standard rate limits.

What about reCAPTCHA or Cloudflare?

For the standard catalog and app-detection flow, no. /products.json and the homepage HTML are not gated. For some reviews APIs, very high request volumes can trigger rate-limits — the actor backs off and retries with jitter.

How is this different from Koala Inspector, ShopScan or BuiltWith?

Koala Inspector, ShopScan and Fera are excellent Chrome extensions for one-store lookups, but none of them do batch — you cannot paste 500 URLs and get a CSV back. BuiltWith is a generic tech-stack tool with broad coverage but its Shopify-app detection is shallow and you cannot pull products in the same call. This actor is purpose-built for Shopify and runs in batch via API: deeper app detection (subscriptions, reviews, popups, search, loyalty), full product catalog, and reviews — all in one pass, billed pay-per-event.

How long does a 1,000-store scan take?

About 25 minutes at default concurrency, costing ~$3 of Apify credits at the standard level. A full run with reviews is closer to an hour and ~$15 depending on review volume.

Can I get one record per variant instead of per product?

Yes. Set include_variants: true in the input and the dataset returns one row per SKU with size/color/price/availability normalized.

What is next

I want to add three things, in order:

Revenue estimation at the pro tier — based on review velocity and product velocity, both of which are observable.
MCP server mode so you can query it from Claude desktop / Cursor.
Theme detection — useful for agency outbound, less useful for me, but I keep being asked.

If you use it and something breaks, ping me — I am the only maintainer and I read every issue. The actor is on Apify Store at kazkn/shopify-scraper-apps-spy.

Tags: #shopify #ecommerce #api #indiehackers

Was this useful? ❤️ a reaction or drop a comment with the use-case you're trying to solve — I read every reply and add detector + endpoint coverage based on what people actually need.

Follow @boo_n for hands-on tutorials: scraping reviews at scale, ICP qualification at $0.005 per store, and turning the actor into an MCP tool for Claude / Cursor.

DEV Community

I built a Shopify scraper that detects apps + pulls products in one API call

The afternoon that broke me

What I actually wanted

The "no headless browser" decision

The architecture (it is small on purpose)

What I learned about app detection

The pay-per-event pricing problem

What surprised me

How to use it in one minute

FAQ

Is scraping `/products.json` allowed?

What about reCAPTCHA or Cloudflare?

How is this different from Koala Inspector, ShopScan or BuiltWith?

How long does a 1,000-store scan take?

Can I get one record per variant instead of per product?

What is next

Top comments (0)

The afternoon that broke me

What I actually wanted

The "no headless browser" decision

The architecture (it is small on purpose)

What I learned about app detection

The pay-per-event pricing problem

What surprised me

How to use it in one minute

FAQ

Is scraping /products.json allowed?

What about reCAPTCHA or Cloudflare?

How is this different from Koala Inspector, ShopScan or BuiltWith?

How long does a 1,000-store scan take?

Can I get one record per variant instead of per product?

What is next

Is scraping `/products.json` allowed?