Web scraping Shopify stores is one of the most common data extraction tasks in e-commerce intelligence. Whether you're doing competitive analysis, building price comparison tools, or aggregating product catalogs, understanding how to extract data from Shopify-powered stores is an essential skill.
In this comprehensive guide, we'll cover Shopify's architecture, the different approaches to extracting product data, reviews, and store metadata, and how to scale your scraping operations using cloud infrastructure like Apify.
Understanding Shopify's Architecture
Shopify powers over 4.8 million online stores worldwide, making it the single most popular e-commerce platform on the planet. Every Shopify store follows a predictable URL structure and data format, which makes scraping significantly more straightforward than many custom-built platforms.
URL Structure
Shopify stores follow consistent URL patterns that you can rely on:
-
Products listing:
https://store.com/collections/all -
Individual product:
https://store.com/products/product-handle -
Product JSON:
https://store.com/products/product-handle.json -
Collections:
https://store.com/collections/collection-handle -
All products JSON:
https://store.com/products.json -
Collection JSON:
https://store.com/collections/all/products.json -
Search:
https://store.com/search?q=keyword&type=product -
Sitemap:
https://store.com/sitemap.xml
This predictability is a huge advantage. Unlike custom-built e-commerce sites where every store has a different layout and data structure, you know exactly where to find data on any Shopify store from day one.
The Built-In JSON Endpoints
One of Shopify's best-kept secrets for data extraction is its built-in JSON API. Most Shopify stores expose product data through JSON endpoints without any authentication required:
// Fetch products from any Shopify store
const response = await fetch('https://example-store.com/products.json?limit=250');
const data = await response.json();
console.log(data.products.length); // Up to 250 products per page
console.log(data.products[0].title);
console.log(data.products[0].variants[0].price);
The /products.json endpoint supports pagination, allowing you to iterate through all products in a store:
// Paginate through all products
async function getAllProducts(storeUrl) {
let page = 1;
let allProducts = [];
while (true) {
const url = `${storeUrl}/products.json?limit=250&page=${page}`;
const response = await fetch(url);
const data = await response.json();
if (data.products.length === 0) break;
allProducts = allProducts.concat(data.products);
page++;
// Be respectful - add delay between requests
await new Promise(resolve => setTimeout(resolve, 1500));
}
return allProducts;
}
const products = await getAllProducts('https://example-store.com');
console.log(`Found ${products.length} total products`);
Important caveat: As of late 2024, Shopify started rate-limiting and restricting access to some JSON endpoints on certain stores. Larger stores or those with custom configurations may block unauthenticated JSON access. Always test if the endpoint is accessible before building your entire scraper around it.
Three Methods for Extracting Product Data
Method 1: JSON API (Preferred Approach)
The JSON API returns richly structured product data with everything you could need:
{
"product": {
"id": 123456789,
"title": "Premium Cotton T-Shirt",
"body_html": "<p>Made from 100% organic cotton...</p>",
"vendor": "BrandName",
"product_type": "T-Shirts",
"created_at": "2024-01-15T10:30:00-05:00",
"handle": "premium-cotton-t-shirt",
"tags": ["cotton", "organic", "bestseller"],
"variants": [
{
"id": 987654321,
"title": "Small / Blue",
"price": "29.99",
"compare_at_price": "39.99",
"sku": "PCT-S-BLU",
"inventory_quantity": 45,
"weight": 200,
"weight_unit": "g"
}
],
"images": [
{
"src": "https://cdn.shopify.com/s/files/1/image.jpg",
"alt": "Premium Cotton T-Shirt - Blue"
}
]
}
}
This gives you pricing, variants, inventory levels, images, tags, vendor information, and more — all cleanly structured and ready to process. For competitive analysis, the compare_at_price field is especially valuable because it reveals discount strategies.
Method 2: DOM Scraping with Cheerio
When JSON endpoints are restricted, you fall back to parsing the HTML directly. Shopify themes vary widely, but the Crawlee framework with Cheerio makes this manageable:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
// Product title - most themes use h1
const title = $('h1.product-title, h1.product__title, h1[itemprop="name"]')
.first().text().trim();
// Price extraction - look for common patterns
const price = $('[data-product-price], .product-price .money, .price-item--regular')
.first().text().trim();
const comparePrice = $('[data-compare-price], .price-item--sale, .product-price__compare')
.first().text().trim();
// Description
const description = $('.product-description, .product__description')
.text().trim();
// Images - Shopify lazy-loads with data-src
const images = [];
$('img[data-src], img.product__image').each((i, el) => {
const src = $(el).attr('data-src') || $(el).attr('src');
if (src && src.includes('cdn.shopify.com')) {
images.push(src.replace(/\_{width}x/, ''));
}
});
// Many themes embed variant JSON in a script tag
let variants = [];
$('script').each((i, el) => {
const text = $(el).html() || '';
const match = text.match(/"variants"\s*:\s*(\[.*?\])/s);
if (match) {
try { variants = JSON.parse(match[1]); } catch {}
}
});
console.log({ title, price, comparePrice, images: images.length, variants: variants.length });
}
});
The challenge with DOM scraping is that Shopify has thousands of themes, each with different CSS class names and HTML structures. Using multiple selectors with fallbacks (as shown above) helps handle theme variation.
Method 3: Storefront API (GraphQL)
For stores that expose a Storefront API token (often found in the page source), you can use GraphQL queries:
// Find the Storefront token in page source
// Look for: X-Shopify-Storefront-Access-Token or accessToken
const storefrontToken = 'found-in-page-source';
const query = `{
products(first: 50, after: null) {
pageInfo {
hasNextPage
endCursor
}
edges {
node {
title
description
handle
productType
vendor
priceRange {
minVariantPrice { amount currencyCode }
maxVariantPrice { amount currencyCode }
}
variants(first: 20) {
edges {
node {
title
price { amount currencyCode }
availableForSale
sku
}
}
}
images(first: 5) {
edges {
node { url altText }
}
}
}
}
}
}`;
const response = await fetch(
'https://store.myshopify.com/api/2024-01/graphql.json',
{
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Shopify-Storefront-Access-Token': storefrontToken
},
body: JSON.stringify({ query })
}
);
const data = await response.json();
const products = data.data.products.edges.map(e => e.node);
The Storefront API is the most reliable data source but requires a valid access token. Many stores embed these tokens in their frontend JavaScript, making them discoverable.
Extracting Review Data
Product reviews are critical for competitive intelligence and sentiment analysis. Shopify doesn't have a native review system, so stores use third-party apps. Each has its own data structure and access method.
Judge.me Reviews
Judge.me is one of the most popular review apps on Shopify. Reviews can be fetched via their public widget API:
async function getJudgeMeReviews(shopDomain, productId) {
const url = `https://judge.me/api/v1/reviews?` +
`shop_domain=${shopDomain}&` +
`api_token=PUBLIC_TOKEN&` +
`product_id=${productId}&` +
`per_page=50&page=1`;
const response = await fetch(url);
const data = await response.json();
return data.reviews.map(r => ({
author: r.reviewer.name,
rating: r.rating,
title: r.title,
body: r.body,
date: r.created_at,
verified: r.verified === 'buyer',
images: r.pictures?.map(p => p.urls.original) || []
}));
}
Loox and Yotpo Reviews
Yotpo and Loox embed review widgets that load data via their own APIs:
// Yotpo - find the app key in page source
const appKey = pageSource.match(/yotpoAppKey['":\s]+['"](\w+)['"]/)?.[1];
const yotpoUrl = `https://api.yotpo.com/v1/widget/${appKey}/products/${productId}/reviews.json?per_page=50&page=1`;
// Loox - reviews are often in iframes
// Parse the Loox widget URL from the page
const looxReviews = await page.evaluate(() => {
const items = document.querySelectorAll('.loox-review');
return Array.from(items).map(el => ({
author: el.querySelector('.loox-review-author')?.textContent,
rating: el.querySelectorAll('.loox-star.loox-filled').length,
body: el.querySelector('.loox-review-content')?.textContent
}));
});
Generic DOM-Based Review Extraction
When you can't identify the review app or access its API, fall back to DOM parsing:
async requestHandler({ $, request }) {
const reviews = [];
// Common review selectors across multiple apps
const reviewSelectors = [
'.spr-review', '.jdgm-rev', '.yotpo-review',
'.loox-review', '.review-item', '[data-review-id]'
];
const selector = reviewSelectors.find(s => $(s).length > 0);
if (!selector) return reviews;
$(selector).each((i, el) => {
reviews.push({
author: $(el).find('.review-author, .spr-review-header-byline, [itemprop="author"]')
.text().trim(),
rating: parseFloat(
$(el).find('[data-rating], .star-rating, [itemprop="ratingValue"]')
.attr('data-rating') || $(el).find('.star.filled, .star-icon--full').length
),
date: $(el).find('.review-date, [itemprop="datePublished"]')
.text().trim(),
title: $(el).find('.review-title, .spr-review-header-title')
.text().trim(),
body: $(el).find('.review-body, .spr-review-content, [itemprop="reviewBody"]')
.text().trim(),
verified: $(el).find('.verified-badge, .spr-badge').length > 0
});
});
return reviews;
}
Monitoring Inventory and Stock Levels
Tracking competitor inventory levels provides actionable business intelligence — you can identify best-sellers, detect stockouts, and understand demand patterns:
async function monitorInventory(storeUrl, productHandle) {
const url = `${storeUrl}/products/${productHandle}.json`;
const { product } = await fetch(url).then(r => r.json());
return product.variants.map(variant => ({
title: variant.title,
sku: variant.sku,
available: variant.available,
inventory_quantity: variant.inventory_quantity,
price: variant.price,
compare_at_price: variant.compare_at_price,
inventory_policy: variant.inventory_policy
}));
}
// Track changes over time
async function trackInventoryChanges(storeUrl, handles) {
const snapshot = {};
for (const handle of handles) {
snapshot[handle] = await monitorInventory(storeUrl, handle);
await new Promise(r => setTimeout(r, 1000)); // rate limit
}
// Compare with previous snapshot to detect changes
return snapshot;
}
Note that some stores hide inventory quantities by configuring inventory_policy: 'continue'. In those cases, you can only see available: true/false without exact counts.
Scaling with Apify
When you need to scrape dozens, hundreds, or thousands of Shopify stores, doing it on your local machine becomes impractical. Network limits, IP bans, and compute resources all become bottlenecks. This is where cloud scraping platforms like Apify become essential.
Why Use Apify?
- Proxy rotation: Built-in datacenter and residential proxy pools
- Scheduling: Run scrapers on cron schedules automatically
- Storage: Datasets, key-value stores, and request queues built in
- Monitoring: Track run history, errors, and performance
- Pre-built actors: Ready-to-use scrapers in the Apify Store
Using Pre-Built Actors from the Apify Store
The fastest way to start scraping Shopify stores is with pre-built actors from the Apify Store. These handle all the edge cases — rate limiting, proxy rotation, theme variations, anti-bot measures — so you can focus on what to do with the data:
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
// Run a Shopify scraper from the Apify Store
const run = await client.actor('shopify-scraper-actor').call({
startUrls: [
{ url: 'https://store1.com' },
{ url: 'https://store2.com' },
{ url: 'https://store3.com' }
],
maxProducts: 1000,
includeReviews: true,
proxyConfiguration: {
useApifyProxy: true,
apifyProxyGroups: ['RESIDENTIAL']
}
});
// Fetch all results
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Scraped ${items.length} products across 3 stores`);
// Export to CSV
const csvBuffer = await client.dataset(run.defaultDatasetId)
.downloadItems('csv');
Building a Custom Apify Actor
When you need custom extraction logic, the Crawlee framework combined with Apify's infrastructure is powerful:
import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';
await Actor.init();
const input = await Actor.getInput();
const { startUrls, maxProducts = 500, extractReviews = false } = input;
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: maxProducts,
maxConcurrency: 5,
async requestHandler({ request, $, enqueueLinks }) {
const url = request.url;
// On collection pages, enqueue product links
if (url.includes('/collections/')) {
await enqueueLinks({
selector: 'a[href*="/products/"]',
baseUrl: request.loadedUrl
});
// Handle pagination
const nextPage = $('a.pagination__next, [rel="next"]').attr('href');
if (nextPage) {
await enqueueLinks({ urls: [new URL(nextPage, url).href] });
}
return;
}
// On product pages, try JSON first
try {
const jsonUrl = url.endsWith('/') ? url + '.json' : url + '.json';
const jsonResp = await fetch(jsonUrl);
if (jsonResp.ok) {
const { product } = await jsonResp.json();
await Dataset.pushData({
url,
source: 'json_api',
title: product.title,
vendor: product.vendor,
productType: product.product_type,
tags: product.tags,
variants: product.variants.map(v => ({
title: v.title, price: v.price, sku: v.sku,
available: v.available
})),
images: product.images.map(i => i.src),
scrapedAt: new Date().toISOString()
});
return;
}
} catch {}
// Fallback to DOM parsing
const jsonLd = $('script[type="application/ld+json"]').html();
const structured = JSON.parse(jsonLd || '{}');
await Dataset.pushData({
url,
source: 'dom_parsing',
title: structured.name || $('h1').first().text().trim(),
price: structured.offers?.price,
currency: structured.offers?.priceCurrency,
description: structured.description?.substring(0, 500),
image: structured.image,
brand: structured.brand?.name,
scrapedAt: new Date().toISOString()
});
}
});
await crawler.run(startUrls.map(u => u.url));
await Actor.exit();
Handling Anti-Scraping Measures
Rate Limiting
Always be respectful. Hammering a store with requests can get your IP banned and potentially cause legal issues:
const crawler = new CheerioCrawler({
maxConcurrency: 3,
maxRequestsPerMinute: 30,
requestHandlerTimeoutSecs: 60,
async requestHandler({ request, $ }) {
// Add randomized delays to appear more human
await new Promise(resolve =>
setTimeout(resolve, 1000 + Math.random() * 2000)
);
// ... extraction logic
}
});
JavaScript-Rendered Content
Some Shopify themes heavily rely on JavaScript for rendering product data. When Cheerio isn't enough, switch to Playwright:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: { headless: true }
},
async requestHandler({ page, request }) {
await page.waitForSelector('.product-grid, .collection-products', {
timeout: 15000
});
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-card')).map(el => ({
title: el.querySelector('.product-title, .card__heading')?.textContent?.trim(),
price: el.querySelector('.price, .money')?.textContent?.trim(),
link: el.querySelector('a')?.href,
image: el.querySelector('img')?.src
}));
});
for (const product of products) {
await Dataset.pushData(product);
}
}
});
Best Practices for Shopify Scraping
Always try JSON endpoints first — they're faster, more reliable, and return cleaner data than DOM scraping.
Respect robots.txt — check
https://store.com/robots.txtbefore scraping. Many Shopify stores disallow certain paths.Rate limit aggressively — keep requests under 30 per minute per store. Use random delays to vary your request pattern.
Leverage structured data — Shopify themes embed JSON-LD, Open Graph, and meta tags. Use these before parsing arbitrary HTML.
Handle errors gracefully — stores go down, pages get removed, products sell out. Build retry logic with exponential backoff.
Cache where possible — if scraping the same store daily, only re-fetch products that changed. Use ETags or Last-Modified headers.
Use residential proxies for scale — datacenter IPs get blocked quickly on popular stores. Residential proxies on Apify last much longer.
Monitor your scraper — set up alerts for sudden drops in data volume, which usually indicate blocking or site changes.
Legal and Ethical Considerations
Web scraping legality varies by jurisdiction, but there are universal best practices:
- Only scrape publicly available data — never bypass authentication or access restricted areas
- Read and respect Terms of Service — some stores explicitly prohibit automated access
- Don't overload servers — excessive concurrent requests can constitute denial of service
- Comply with GDPR/CCPA — if collecting personal data (reviewer names, etc.), ensure proper compliance
- Use data responsibly — scraped data should inform business decisions, not enable spam or unfair practices
- Consider the store owner — would you be comfortable if someone scraped your store this way?
Conclusion
Shopify's predictable architecture makes it one of the most scraper-friendly e-commerce platforms. Whether you use the built-in JSON endpoints, DOM scraping with Cheerio, or the Storefront GraphQL API, the key is choosing the right approach for your specific use case and scaling needs.
For small-scale competitive analysis, a simple script using the JSON endpoints might be all you need. For large-scale market intelligence across hundreds of stores, leveraging cloud infrastructure like the Apify Store with its pre-built actors and proxy management will save you significant development and operational overhead.
The best approach is layered: start with JSON APIs, fall back to structured data (JSON-LD), then DOM parsing, and finally JavaScript rendering — each progressively more complex but more robust. Combined with respectful rate limiting and proper proxy rotation, you can build reliable Shopify scraping pipelines that deliver actionable e-commerce intelligence at scale.
Top comments (0)