Web scraping Shopify stores has become one of the most common data extraction tasks in e-commerce intelligence. Whether you're monitoring competitor prices, building a product comparison engine, or analyzing market trends, Shopify's predictable structure makes it an excellent target for automated data collection.
In this comprehensive guide, I'll walk you through everything you need to know about extracting product data, reviews, and inventory information from Shopify-powered stores.
Understanding Shopify's Storefront Architecture
Shopify powers over 4.6 million websites globally. What makes it particularly interesting for web scraping is its consistent, well-structured data layer. Every Shopify store follows the same underlying architecture, which means a scraper built for one store can often work across thousands of others with minimal modifications.
The JSON Product API
Every Shopify store exposes a built-in JSON API that doesn't require authentication for public product data. This is the single most important thing to understand about Shopify scraping:
https://store-domain.com/products.json
https://store-domain.com/products.json?page=2&limit=250
https://store-domain.com/collections/all/products.json
This endpoint returns structured JSON containing product titles, descriptions, variants, prices, images, tags, and more. No API key required for public storefronts.
Product Page Structure
Individual product pages follow this pattern:
https://store-domain.com/products/product-handle.json
Each product page also embeds structured data in JSON-LD format within the HTML, which search engines use and which provides another reliable extraction point.
Extracting Product Catalog Data
Let's start with the most fundamental task: pulling the complete product catalog from a Shopify store.
JavaScript Approach (Node.js)
const axios = require('axios');
async function scrapeShopifyProducts(storeUrl) {
const products = [];
let page = 1;
const limit = 250; // Maximum allowed by Shopify
while (true) {
try {
const url = `${storeUrl}/products.json?page=${page}&limit=${limit}`;
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; ProductResearch/1.0)'
}
});
const pageProducts = response.data.products;
if (!pageProducts || pageProducts.length === 0) {
break;
}
for (const product of pageProducts) {
products.push({
id: product.id,
title: product.title,
vendor: product.vendor,
productType: product.product_type,
handle: product.handle,
createdAt: product.created_at,
updatedAt: product.updated_at,
tags: product.tags,
variants: product.variants.map(v => ({
id: v.id,
title: v.title,
price: v.price,
compareAtPrice: v.compare_at_price,
sku: v.sku,
available: v.available,
inventoryQuantity: v.inventory_quantity
})),
images: product.images.map(img => img.src)
});
}
console.log(`Page ${page}: Found ${pageProducts.length} products`);
page++;
// Respectful delay between requests
await new Promise(resolve => setTimeout(resolve, 1000));
} catch (error) {
if (error.response && error.response.status === 429) {
console.log('Rate limited. Waiting 30 seconds...');
await new Promise(resolve => setTimeout(resolve, 30000));
continue;
}
break;
}
}
console.log(`Total products scraped: ${products.length}`);
return products;
}
// Usage
scrapeShopifyProducts('https://example-store.myshopify.com')
.then(products => {
const fs = require('fs');
fs.writeFileSync('products.json', JSON.stringify(products, null, 2));
});
Python Approach
import requests
import json
import time
def scrape_shopify_products(store_url):
products = []
page = 1
limit = 250
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; ProductResearch/1.0)'
})
while True:
url = f"{store_url}/products.json?page={page}&limit={limit}"
try:
response = session.get(url, timeout=30)
if response.status_code == 429:
print("Rate limited. Waiting 30 seconds...")
time.sleep(30)
continue
response.raise_for_status()
data = response.json()
page_products = data.get('products', [])
if not page_products:
break
for product in page_products:
products.append({
'id': product['id'],
'title': product['title'],
'vendor': product['vendor'],
'product_type': product['product_type'],
'handle': product['handle'],
'tags': product.get('tags', ''),
'variants': [{
'id': v['id'],
'title': v['title'],
'price': v['price'],
'compare_at_price': v.get('compare_at_price'),
'sku': v.get('sku'),
'available': v.get('available', False),
} for v in product.get('variants', [])],
'images': [img['src'] for img in product.get('images', [])]
})
print(f"Page {page}: {len(page_products)} products")
page += 1
time.sleep(1)
except requests.exceptions.RequestException as e:
print(f"Error on page {page}: {e}")
break
print(f"Total: {len(products)} products")
return products
# Usage
products = scrape_shopify_products("https://example-store.myshopify.com")
with open("products.json", "w") as f:
json.dump(products, f, indent=2)
Handling Pagination and Large Catalogs
Shopify's /products.json endpoint has a limit of 250 products per page. For stores with thousands of products, you need proper pagination handling.
Cursor-Based Pagination
Some Shopify stores use cursor-based pagination through the Storefront API. Here's how to handle it:
async function scrapeWithCursor(storeUrl) {
let cursor = null;
const allProducts = [];
while (true) {
let url = `${storeUrl}/products.json?limit=250`;
if (cursor) {
url += `&page_info=${cursor}`;
}
const response = await axios.get(url);
const products = response.data.products;
if (products.length === 0) break;
allProducts.push(...products);
// Check for Link header with next page cursor
const linkHeader = response.headers['link'];
if (linkHeader && linkHeader.includes('rel="next"')) {
const match = linkHeader.match(/page_info=([^>&]*)/);
cursor = match ? match[1] : null;
} else {
break;
}
await new Promise(r => setTimeout(r, 1000));
}
return allProducts;
}
Collection-Based Approach
For very large stores, scraping by collection often yields better results:
import requests
import time
def get_collections(store_url):
# Fetch all collections from a Shopify store
response = requests.get(f"{store_url}/collections.json")
return response.json().get('collections', [])
def scrape_by_collection(store_url):
# Scrape products organized by collection
collections = get_collections(store_url)
all_products = {}
for collection in collections:
handle = collection['handle']
page = 1
while True:
url = f"{store_url}/collections/{handle}/products.json?page={page}&limit=250"
response = requests.get(url)
products = response.json().get('products', [])
if not products:
break
for product in products:
product_id = product['id']
if product_id not in all_products:
all_products[product_id] = product
all_products[product_id]['collections'] = []
all_products[product_id]['collections'].append(handle)
page += 1
time.sleep(1)
return list(all_products.values())
Extracting Product Reviews
Reviews are crucial for competitive analysis and sentiment monitoring. Shopify stores typically use third-party review apps, each with their own data format.
Common Review Apps and Their Endpoints
Judge.me Reviews:
async function scrapeJudgeMeReviews(shopDomain, apiToken) {
const reviews = [];
let page = 1;
while (true) {
const url = `https://judge.me/api/v1/reviews?` +
`shop_domain=${shopDomain}&page=${page}&per_page=100`;
const response = await axios.get(url, {
headers: { 'Authorization': `Bearer ${apiToken}` }
});
const data = response.data.reviews;
if (!data || data.length === 0) break;
reviews.push(...data.map(r => ({
id: r.id,
rating: r.rating,
title: r.title,
body: r.body,
reviewer: r.reviewer.name,
createdAt: r.created_at,
productId: r.product_external_id,
verified: r.verified_buyer
})));
page++;
await new Promise(r => setTimeout(r, 500));
}
return reviews;
}
Extracting Reviews from HTML (Generic Approach):
from bs4 import BeautifulSoup
import requests
import re
def extract_reviews_from_page(product_url):
# Generic review extraction from product page HTML
response = requests.get(product_url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
})
soup = BeautifulSoup(response.text, 'html.parser')
reviews = []
# Look for common review containers
review_selectors = [
'.spr-review', # Shopify Product Reviews
'.jdgm-rev', # Judge.me
'.loox-review', # Loox
'.yotpo-review', # Yotpo
'[data-review-id]', # Generic data attribute
]
for selector in review_selectors:
review_elements = soup.select(selector)
if review_elements:
for elem in review_elements:
review = {}
# Extract rating
stars = elem.select_one('[class*="star"]')
if stars:
rating_match = re.search(r'(\d)', str(stars.get('style', '')))
review['rating'] = rating_match.group(1) if rating_match else None
# Extract review text
body = elem.select_one('[class*="body"], [class*="content"], [class*="text"]')
review['body'] = body.get_text(strip=True) if body else ''
# Extract author
author = elem.select_one('[class*="author"], [class*="name"]')
review['author'] = author.get_text(strip=True) if author else 'Anonymous'
if review.get('body'):
reviews.append(review)
break
return reviews
Price Monitoring and Inventory Tracking
One of the highest-value applications of Shopify scraping is automated price monitoring.
Building a Price Monitor
const axios = require('axios');
const fs = require('fs');
class ShopifyPriceMonitor {
constructor(stores) {
this.stores = stores;
this.historyFile = 'price_history.json';
this.history = this.loadHistory();
}
loadHistory() {
try {
return JSON.parse(fs.readFileSync(this.historyFile, 'utf8'));
} catch {
return {};
}
}
saveHistory() {
fs.writeFileSync(this.historyFile, JSON.stringify(this.history, null, 2));
}
async checkPrices() {
const changes = [];
const timestamp = new Date().toISOString();
for (const store of this.stores) {
try {
const response = await axios.get(
`${store.url}/products.json?limit=250`
);
for (const product of response.data.products) {
for (const variant of product.variants) {
const key = `${store.name}:${variant.id}`;
const currentPrice = parseFloat(variant.price);
const previousEntry = this.history[key];
if (previousEntry) {
const previousPrice = previousEntry.price;
if (currentPrice !== previousPrice) {
const change = {
store: store.name,
product: product.title,
variant: variant.title,
oldPrice: previousPrice,
newPrice: currentPrice,
change: ((currentPrice - previousPrice) / previousPrice * 100).toFixed(2),
timestamp
};
changes.push(change);
console.log(
`PRICE CHANGE: ${product.title} ` +
`$${previousPrice} -> $${currentPrice} ` +
`(${change.change}%)`
);
}
}
this.history[key] = {
price: currentPrice,
available: variant.available,
product: product.title,
variant: variant.title,
lastChecked: timestamp
};
}
}
await new Promise(r => setTimeout(r, 2000));
} catch (err) {
console.error(`Error checking ${store.name}: ${err.message}`);
}
}
this.saveHistory();
return changes;
}
}
// Usage
const monitor = new ShopifyPriceMonitor([
{ name: 'Store A', url: 'https://store-a.myshopify.com' },
{ name: 'Store B', url: 'https://store-b.myshopify.com' }
]);
monitor.checkPrices().then(changes => {
console.log(`Found ${changes.length} price changes`);
});
Inventory Level Detection
import requests
def check_inventory_status(store_url):
# Check inventory availability across all products
response = requests.get(f"{store_url}/products.json?limit=250")
products = response.json().get('products', [])
inventory_report = []
for product in products:
for variant in product.get('variants', []):
status = {
'product': product['title'],
'variant': variant['title'],
'price': variant['price'],
'available': variant.get('available', False),
'inventory_policy': variant.get('inventory_policy', 'deny'),
'inventory_quantity': variant.get('inventory_quantity', 'N/A')
}
inventory_report.append(status)
out_of_stock = [item for item in inventory_report if not item['available']]
low_stock = [item for item in inventory_report
if item['available']
and isinstance(item['inventory_quantity'], int)
and item['inventory_quantity'] < 10]
return {
'total_variants': len(inventory_report),
'out_of_stock': len(out_of_stock),
'low_stock': len(low_stock),
'details': inventory_report
}
Using Apify Store for Shopify Scraping
While building your own scraper gives you full control, using pre-built actors on the Apify Store can save significant development time.
Apify offers several ready-to-use Shopify scrapers that handle all the edge cases — rate limiting, pagination, proxy rotation, and data formatting. These actors run in the cloud, so you don't need to manage infrastructure.
Key Benefits of Using Apify Actors
- Proxy management: Automatic rotation to avoid IP blocks
- Scheduled runs: Set up daily, hourly, or custom monitoring schedules
- Cloud execution: No local infrastructure needed
- Data export: Direct export to CSV, JSON, Excel, or webhooks
- Monitoring dashboards: Track run success rates and data quality
Example: Using Apify SDK
const { ApifyClient } = require('apify-client');
const client = new ApifyClient({
token: 'YOUR_APIFY_TOKEN',
});
async function runShopifyScraper(storeUrl) {
const run = await client.actor('apify/shopify-scraper').call({
startUrls: [{ url: storeUrl }],
maxProducts: 1000,
includeReviews: true,
proxy: {
useApifyProxy: true,
apifyProxyGroups: ['RESIDENTIAL']
}
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Scraped ${items.length} products`);
return items;
}
Best Practices and Rate Limiting
When scraping Shopify stores, following best practices ensures reliability and avoids issues:
Respectful Scraping Guidelines
Add delays between requests: At minimum 1 second between requests. Shopify's rate limit is typically 2 requests per second for unauthenticated access.
Use meaningful User-Agent strings: Identify your bot clearly rather than pretending to be a browser.
Handle 429 responses gracefully: Implement exponential backoff when rate limited.
Cache responses: Don't re-scrape data that hasn't changed. Check
updated_attimestamps.Respect robots.txt: While
/products.jsonis generally accessible, check the store's robots.txt for any specific restrictions.
Error Handling Pattern
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_resilient_session():
session = requests.Session()
retries = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"]
)
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
Data Storage and Analysis
Once you've collected product data, proper storage enables analysis:
import sqlite3
import json
def store_products(products, db_path="shopify_data.db"):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
title TEXT,
vendor TEXT,
product_type TEXT,
handle TEXT,
tags TEXT,
created_at TEXT,
updated_at TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS variants (
id INTEGER PRIMARY KEY,
product_id INTEGER,
title TEXT,
price REAL,
compare_at_price REAL,
sku TEXT,
available BOOLEAN,
FOREIGN KEY (product_id) REFERENCES products(id)
)
''')
for product in products:
cursor.execute('''
INSERT OR REPLACE INTO products
(id, title, vendor, product_type, handle, tags)
VALUES (?, ?, ?, ?, ?, ?)
''', (
product['id'], product['title'], product['vendor'],
product['product_type'], product['handle'],
json.dumps(product.get('tags', []))
))
for variant in product.get('variants', []):
cursor.execute('''
INSERT OR REPLACE INTO variants
(id, product_id, title, price, compare_at_price, sku, available)
VALUES (?, ?, ?, ?, ?, ?, ?)
''', (
variant['id'], product['id'], variant['title'],
float(variant['price']),
float(variant['compare_at_price']) if variant.get('compare_at_price') else None,
variant.get('sku'), variant.get('available', False)
))
conn.commit()
conn.close()
Conclusion
Shopify's consistent architecture makes it one of the most accessible e-commerce platforms for web scraping. The built-in JSON API eliminates much of the complexity you'd face with other platforms, and the predictable URL structure means your scrapers are reliable and maintainable.
Key takeaways:
-
Start with
/products.json— it's the fastest path to product data - Handle pagination properly — use cursor-based pagination for large catalogs
- Respect rate limits — 1-2 second delays between requests prevent blocks
- Monitor prices over time — single snapshots are less valuable than trend data
- Use cloud platforms like Apify for production workloads that need proxy rotation and scheduling
- Store data in structured formats for analysis and comparison
Whether you're building a price comparison tool, tracking competitor inventory, or analyzing market trends, the techniques in this guide give you a solid foundation for extracting value from Shopify's vast ecosystem of online stores.
Top comments (0)