agenthustler

Posted on Apr 9 • Edited on Apr 19

Shopify Store Scraping: Extract Product Data, Reviews and Inventory

#webdev #javascript #programming #webscraping

Web scraping Shopify stores has become one of the most common data extraction tasks in e-commerce intelligence. Whether you're monitoring competitor prices, building a product comparison engine, or analyzing market trends, Shopify's predictable structure makes it an excellent target for automated data collection.

In this comprehensive guide, I'll walk you through everything you need to know about extracting product data, reviews, and inventory information from Shopify-powered stores.

Understanding Shopify's Storefront Architecture

Shopify powers over 4.6 million websites globally. What makes it particularly interesting for web scraping is its consistent, well-structured data layer. Every Shopify store follows the same underlying architecture, which means a scraper built for one store can often work across thousands of others with minimal modifications.

The JSON Product API

Every Shopify store exposes a built-in JSON API that doesn't require authentication for public product data. This is the single most important thing to understand about Shopify scraping:

https://store-domain.com/products.json
https://store-domain.com/products.json?page=2&limit=250
https://store-domain.com/collections/all/products.json

This endpoint returns structured JSON containing product titles, descriptions, variants, prices, images, tags, and more. No API key required for public storefronts.

Product Page Structure

Individual product pages follow this pattern:

https://store-domain.com/products/product-handle.json

Each product page also embeds structured data in JSON-LD format within the HTML, which search engines use and which provides another reliable extraction point.

Extracting Product Catalog Data

Let's start with the most fundamental task: pulling the complete product catalog from a Shopify store.

JavaScript Approach (Node.js)

const axios = require('axios');

async function scrapeShopifyProducts(storeUrl) {
    const products = [];
    let page = 1;
    const limit = 250; // Maximum allowed by Shopify

    while (true) {
        try {
            const url = `${storeUrl}/products.json?page=${page}&limit=${limit}`;
            const response = await axios.get(url, {
                headers: {
                    'User-Agent': 'Mozilla/5.0 (compatible; ProductResearch/1.0)'
                }
            });

            const pageProducts = response.data.products;

            if (!pageProducts || pageProducts.length === 0) {
                break;
            }

            for (const product of pageProducts) {
                products.push({
                    id: product.id,
                    title: product.title,
                    vendor: product.vendor,
                    productType: product.product_type,
                    handle: product.handle,
                    createdAt: product.created_at,
                    updatedAt: product.updated_at,
                    tags: product.tags,
                    variants: product.variants.map(v => ({
                        id: v.id,
                        title: v.title,
                        price: v.price,
                        compareAtPrice: v.compare_at_price,
                        sku: v.sku,
                        available: v.available,
                        inventoryQuantity: v.inventory_quantity
                    })),
                    images: product.images.map(img => img.src)
                });
            }

            console.log(`Page ${page}: Found ${pageProducts.length} products`);
            page++;

            // Respectful delay between requests
            await new Promise(resolve => setTimeout(resolve, 1000));
        } catch (error) {
            if (error.response && error.response.status === 429) {
                console.log('Rate limited. Waiting 30 seconds...');
                await new Promise(resolve => setTimeout(resolve, 30000));
                continue;
            }
            break;
        }
    }

    console.log(`Total products scraped: ${products.length}`);
    return products;
}

// Usage
scrapeShopifyProducts('https://example-store.myshopify.com')
    .then(products => {
        const fs = require('fs');
        fs.writeFileSync('products.json', JSON.stringify(products, null, 2));
    });

Python Approach

import requests
import json
import time

def scrape_shopify_products(store_url):
    products = []
    page = 1
    limit = 250

    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (compatible; ProductResearch/1.0)'
    })

    while True:
        url = f"{store_url}/products.json?page={page}&limit={limit}"

        try:
            response = session.get(url, timeout=30)

            if response.status_code == 429:
                print("Rate limited. Waiting 30 seconds...")
                time.sleep(30)
                continue

            response.raise_for_status()
            data = response.json()

            page_products = data.get('products', [])
            if not page_products:
                break

            for product in page_products:
                products.append({
                    'id': product['id'],
                    'title': product['title'],
                    'vendor': product['vendor'],
                    'product_type': product['product_type'],
                    'handle': product['handle'],
                    'tags': product.get('tags', ''),
                    'variants': [{
                        'id': v['id'],
                        'title': v['title'],
                        'price': v['price'],
                        'compare_at_price': v.get('compare_at_price'),
                        'sku': v.get('sku'),
                        'available': v.get('available', False),
                    } for v in product.get('variants', [])],
                    'images': [img['src'] for img in product.get('images', [])]
                })

            print(f"Page {page}: {len(page_products)} products")
            page += 1
            time.sleep(1)

        except requests.exceptions.RequestException as e:
            print(f"Error on page {page}: {e}")
            break

    print(f"Total: {len(products)} products")
    return products

# Usage
products = scrape_shopify_products("https://example-store.myshopify.com")
with open("products.json", "w") as f:
    json.dump(products, f, indent=2)

Handling Pagination and Large Catalogs

Shopify's /products.json endpoint has a limit of 250 products per page. For stores with thousands of products, you need proper pagination handling.

Cursor-Based Pagination

Some Shopify stores use cursor-based pagination through the Storefront API. Here's how to handle it:

async function scrapeWithCursor(storeUrl) {
    let cursor = null;
    const allProducts = [];

    while (true) {
        let url = `${storeUrl}/products.json?limit=250`;
        if (cursor) {
            url += `&page_info=${cursor}`;
        }

        const response = await axios.get(url);
        const products = response.data.products;

        if (products.length === 0) break;

        allProducts.push(...products);

        // Check for Link header with next page cursor
        const linkHeader = response.headers['link'];
        if (linkHeader && linkHeader.includes('rel="next"')) {
            const match = linkHeader.match(/page_info=([^>&]*)/);
            cursor = match ? match[1] : null;
        } else {
            break;
        }

        await new Promise(r => setTimeout(r, 1000));
    }

    return allProducts;
}

Collection-Based Approach

For very large stores, scraping by collection often yields better results:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Extracting Product Reviews

Reviews are crucial for competitive analysis and sentiment monitoring. Shopify stores typically use third-party review apps, each with their own data format.

Common Review Apps and Their Endpoints

Judge.me Reviews:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Extracting Reviews from HTML (Generic Approach):

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Price Monitoring and Inventory Tracking

One of the highest-value applications of Shopify scraping is automated price monitoring.

Building a Price Monitor

const axios = require('axios');
const fs = require('fs');

class ShopifyPriceMonitor {
    constructor(stores) {
        this.stores = stores;
        this.historyFile = 'price_history.json';
        this.history = this.loadHistory();
    }

    loadHistory() {
        try {
            return JSON.parse(fs.readFileSync(this.historyFile, 'utf8'));
        } catch {
            return {};
        }
    }

    saveHistory() {
        fs.writeFileSync(this.historyFile, JSON.stringify(this.history, null, 2));
    }

    async checkPrices() {
        const changes = [];
        const timestamp = new Date().toISOString();

        for (const store of this.stores) {
            try {
                const response = await axios.get(
                    `${store.url}/products.json?limit=250`
                );

                for (const product of response.data.products) {
                    for (const variant of product.variants) {
                        const key = `${store.name}:${variant.id}`;
                        const currentPrice = parseFloat(variant.price);
                        const previousEntry = this.history[key];

                        if (previousEntry) {
                            const previousPrice = previousEntry.price;
                            if (currentPrice !== previousPrice) {
                                const change = {
                                    store: store.name,
                                    product: product.title,
                                    variant: variant.title,
                                    oldPrice: previousPrice,
                                    newPrice: currentPrice,
                                    change: ((currentPrice - previousPrice) / previousPrice * 100).toFixed(2),
                                    timestamp
                                };
                                changes.push(change);
                                console.log(
                                    `PRICE CHANGE: ${product.title} ` +
                                    `$${previousPrice} -> $${currentPrice} ` +
                                    `(${change.change}%)`
                                );
                            }
                        }

                        this.history[key] = {
                            price: currentPrice,
                            available: variant.available,
                            product: product.title,
                            variant: variant.title,
                            lastChecked: timestamp
                        };
                    }
                }

                await new Promise(r => setTimeout(r, 2000));
            } catch (err) {
                console.error(`Error checking ${store.name}: ${err.message}`);
            }
        }

        this.saveHistory();
        return changes;
    }
}

// Usage
const monitor = new ShopifyPriceMonitor([
    { name: 'Store A', url: 'https://store-a.myshopify.com' },
    { name: 'Store B', url: 'https://store-b.myshopify.com' }
]);

monitor.checkPrices().then(changes => {
    console.log(`Found ${changes.length} price changes`);
});

Inventory Level Detection

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Using Apify Store for Shopify Scraping

While building your own scraper gives you full control, using pre-built actors on the Apify Store can save significant development time.

Apify offers several ready-to-use Shopify scrapers that handle all the edge cases — rate limiting, pagination, proxy rotation, and data formatting. These actors run in the cloud, so you don't need to manage infrastructure.

Key Benefits of Using Apify Actors

Proxy management: Automatic rotation to avoid IP blocks
Scheduled runs: Set up daily, hourly, or custom monitoring schedules
Cloud execution: No local infrastructure needed
Data export: Direct export to CSV, JSON, Excel, or webhooks
Monitoring dashboards: Track run success rates and data quality

Example: Using Apify SDK

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_TOKEN',
});

async function runShopifyScraper(storeUrl) {
    const run = await client.actor('apify/shopify-scraper').call({
        startUrls: [{ url: storeUrl }],
        maxProducts: 1000,
        includeReviews: true,
        proxy: {
            useApifyProxy: true,
            apifyProxyGroups: ['RESIDENTIAL']
        }
    });

    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    console.log(`Scraped ${items.length} products`);
    return items;
}

Best Practices and Rate Limiting

When scraping Shopify stores, following best practices ensures reliability and avoids issues:

Respectful Scraping Guidelines

Add delays between requests: At minimum 1 second between requests. Shopify's rate limit is typically 2 requests per second for unauthenticated access.
Use meaningful User-Agent strings: Identify your bot clearly rather than pretending to be a browser.
Handle 429 responses gracefully: Implement exponential backoff when rate limited.
Cache responses: Don't re-scrape data that hasn't changed. Check updated_at timestamps.
Respect robots.txt: While /products.json is generally accessible, check the store's robots.txt for any specific restrictions.

Error Handling Pattern

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_resilient_session():
    session = requests.Session()
    retries = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

Data Storage and Analysis

Once you've collected product data, proper storage enables analysis:

import sqlite3
import json

def store_products(products, db_path="shopify_data.db"):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS products (
            id INTEGER PRIMARY KEY,
            title TEXT,
            vendor TEXT,
            product_type TEXT,
            handle TEXT,
            tags TEXT,
            created_at TEXT,
            updated_at TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS variants (
            id INTEGER PRIMARY KEY,
            product_id INTEGER,
            title TEXT,
            price REAL,
            compare_at_price REAL,
            sku TEXT,
            available BOOLEAN,
            FOREIGN KEY (product_id) REFERENCES products(id)
        )
    ''')

    for product in products:
        cursor.execute('''
            INSERT OR REPLACE INTO products
            (id, title, vendor, product_type, handle, tags)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            product['id'], product['title'], product['vendor'],
            product['product_type'], product['handle'],
            json.dumps(product.get('tags', []))
        ))

        for variant in product.get('variants', []):
            cursor.execute('''
                INSERT OR REPLACE INTO variants
                (id, product_id, title, price, compare_at_price, sku, available)
                VALUES (?, ?, ?, ?, ?, ?, ?)
            ''', (
                variant['id'], product['id'], variant['title'],
                float(variant['price']),
                float(variant['compare_at_price']) if variant.get('compare_at_price') else None,
                variant.get('sku'), variant.get('available', False)
            ))

    conn.commit()
    conn.close()

Conclusion

Shopify's consistent architecture makes it one of the most accessible e-commerce platforms for web scraping. The built-in JSON API eliminates much of the complexity you'd face with other platforms, and the predictable URL structure means your scrapers are reliable and maintainable.

Key takeaways:

Start with /products.json — it's the fastest path to product data
Handle pagination properly — use cursor-based pagination for large catalogs
Respect rate limits — 1-2 second delays between requests prevent blocks
Monitor prices over time — single snapshots are less valuable than trend data
Use cloud platforms like Apify for production workloads that need proxy rotation and scheduling
Store data in structured formats for analysis and comparison

Whether you're building a price comparison tool, tracking competitor inventory, or analyzing market trends, the techniques in this guide give you a solid foundation for extracting value from Shopify's vast ecosystem of online stores.

Need custom scraping? We build it for you.

If this guide helped but you need scraping at scale or a custom solution:

👉 Get a custom web scraper built in 48h → (from $99, pay with crypto)

Or use our ready-made Apify actors: cryptosignals on Apify Store

Top comments (1)

Knowband • May 18

Good breakdown of Shopify’s storefront structure. The built-in JSON endpoints are surprisingly powerful for product intelligence and monitoring workflows when used responsibly.