DEV Community

agenthustler
agenthustler

Posted on

Shopify Store Scraping: Extract Product Data, Reviews and Inventory

Web scraping Shopify stores has become one of the most common data extraction tasks in e-commerce intelligence. Whether you're monitoring competitor prices, building a product comparison engine, or analyzing market trends, Shopify's predictable structure makes it an excellent target for automated data collection.

In this comprehensive guide, I'll walk you through everything you need to know about extracting product data, reviews, and inventory information from Shopify-powered stores.

Understanding Shopify's Storefront Architecture

Shopify powers over 4.6 million websites globally. What makes it particularly interesting for web scraping is its consistent, well-structured data layer. Every Shopify store follows the same underlying architecture, which means a scraper built for one store can often work across thousands of others with minimal modifications.

The JSON Product API

Every Shopify store exposes a built-in JSON API that doesn't require authentication for public product data. This is the single most important thing to understand about Shopify scraping:

https://store-domain.com/products.json
https://store-domain.com/products.json?page=2&limit=250
https://store-domain.com/collections/all/products.json
Enter fullscreen mode Exit fullscreen mode

This endpoint returns structured JSON containing product titles, descriptions, variants, prices, images, tags, and more. No API key required for public storefronts.

Product Page Structure

Individual product pages follow this pattern:

https://store-domain.com/products/product-handle.json
Enter fullscreen mode Exit fullscreen mode

Each product page also embeds structured data in JSON-LD format within the HTML, which search engines use and which provides another reliable extraction point.

Extracting Product Catalog Data

Let's start with the most fundamental task: pulling the complete product catalog from a Shopify store.

JavaScript Approach (Node.js)

const axios = require('axios');

async function scrapeShopifyProducts(storeUrl) {
    const products = [];
    let page = 1;
    const limit = 250; // Maximum allowed by Shopify

    while (true) {
        try {
            const url = `${storeUrl}/products.json?page=${page}&limit=${limit}`;
            const response = await axios.get(url, {
                headers: {
                    'User-Agent': 'Mozilla/5.0 (compatible; ProductResearch/1.0)'
                }
            });

            const pageProducts = response.data.products;

            if (!pageProducts || pageProducts.length === 0) {
                break;
            }

            for (const product of pageProducts) {
                products.push({
                    id: product.id,
                    title: product.title,
                    vendor: product.vendor,
                    productType: product.product_type,
                    handle: product.handle,
                    createdAt: product.created_at,
                    updatedAt: product.updated_at,
                    tags: product.tags,
                    variants: product.variants.map(v => ({
                        id: v.id,
                        title: v.title,
                        price: v.price,
                        compareAtPrice: v.compare_at_price,
                        sku: v.sku,
                        available: v.available,
                        inventoryQuantity: v.inventory_quantity
                    })),
                    images: product.images.map(img => img.src)
                });
            }

            console.log(`Page ${page}: Found ${pageProducts.length} products`);
            page++;

            // Respectful delay between requests
            await new Promise(resolve => setTimeout(resolve, 1000));
        } catch (error) {
            if (error.response && error.response.status === 429) {
                console.log('Rate limited. Waiting 30 seconds...');
                await new Promise(resolve => setTimeout(resolve, 30000));
                continue;
            }
            break;
        }
    }

    console.log(`Total products scraped: ${products.length}`);
    return products;
}

// Usage
scrapeShopifyProducts('https://example-store.myshopify.com')
    .then(products => {
        const fs = require('fs');
        fs.writeFileSync('products.json', JSON.stringify(products, null, 2));
    });
Enter fullscreen mode Exit fullscreen mode

Python Approach

import requests
import json
import time

def scrape_shopify_products(store_url):
    products = []
    page = 1
    limit = 250

    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (compatible; ProductResearch/1.0)'
    })

    while True:
        url = f"{store_url}/products.json?page={page}&limit={limit}"

        try:
            response = session.get(url, timeout=30)

            if response.status_code == 429:
                print("Rate limited. Waiting 30 seconds...")
                time.sleep(30)
                continue

            response.raise_for_status()
            data = response.json()

            page_products = data.get('products', [])
            if not page_products:
                break

            for product in page_products:
                products.append({
                    'id': product['id'],
                    'title': product['title'],
                    'vendor': product['vendor'],
                    'product_type': product['product_type'],
                    'handle': product['handle'],
                    'tags': product.get('tags', ''),
                    'variants': [{
                        'id': v['id'],
                        'title': v['title'],
                        'price': v['price'],
                        'compare_at_price': v.get('compare_at_price'),
                        'sku': v.get('sku'),
                        'available': v.get('available', False),
                    } for v in product.get('variants', [])],
                    'images': [img['src'] for img in product.get('images', [])]
                })

            print(f"Page {page}: {len(page_products)} products")
            page += 1
            time.sleep(1)

        except requests.exceptions.RequestException as e:
            print(f"Error on page {page}: {e}")
            break

    print(f"Total: {len(products)} products")
    return products

# Usage
products = scrape_shopify_products("https://example-store.myshopify.com")
with open("products.json", "w") as f:
    json.dump(products, f, indent=2)
Enter fullscreen mode Exit fullscreen mode

Handling Pagination and Large Catalogs

Shopify's /products.json endpoint has a limit of 250 products per page. For stores with thousands of products, you need proper pagination handling.

Cursor-Based Pagination

Some Shopify stores use cursor-based pagination through the Storefront API. Here's how to handle it:

async function scrapeWithCursor(storeUrl) {
    let cursor = null;
    const allProducts = [];

    while (true) {
        let url = `${storeUrl}/products.json?limit=250`;
        if (cursor) {
            url += `&page_info=${cursor}`;
        }

        const response = await axios.get(url);
        const products = response.data.products;

        if (products.length === 0) break;

        allProducts.push(...products);

        // Check for Link header with next page cursor
        const linkHeader = response.headers['link'];
        if (linkHeader && linkHeader.includes('rel="next"')) {
            const match = linkHeader.match(/page_info=([^>&]*)/);
            cursor = match ? match[1] : null;
        } else {
            break;
        }

        await new Promise(r => setTimeout(r, 1000));
    }

    return allProducts;
}
Enter fullscreen mode Exit fullscreen mode

Collection-Based Approach

For very large stores, scraping by collection often yields better results:

import requests
import time

def get_collections(store_url):
    # Fetch all collections from a Shopify store
    response = requests.get(f"{store_url}/collections.json")
    return response.json().get('collections', [])

def scrape_by_collection(store_url):
    # Scrape products organized by collection
    collections = get_collections(store_url)
    all_products = {}

    for collection in collections:
        handle = collection['handle']
        page = 1

        while True:
            url = f"{store_url}/collections/{handle}/products.json?page={page}&limit=250"
            response = requests.get(url)
            products = response.json().get('products', [])

            if not products:
                break

            for product in products:
                product_id = product['id']
                if product_id not in all_products:
                    all_products[product_id] = product
                    all_products[product_id]['collections'] = []
                all_products[product_id]['collections'].append(handle)

            page += 1
            time.sleep(1)

    return list(all_products.values())
Enter fullscreen mode Exit fullscreen mode

Extracting Product Reviews

Reviews are crucial for competitive analysis and sentiment monitoring. Shopify stores typically use third-party review apps, each with their own data format.

Common Review Apps and Their Endpoints

Judge.me Reviews:

async function scrapeJudgeMeReviews(shopDomain, apiToken) {
    const reviews = [];
    let page = 1;

    while (true) {
        const url = `https://judge.me/api/v1/reviews?` +
            `shop_domain=${shopDomain}&page=${page}&per_page=100`;

        const response = await axios.get(url, {
            headers: { 'Authorization': `Bearer ${apiToken}` }
        });

        const data = response.data.reviews;
        if (!data || data.length === 0) break;

        reviews.push(...data.map(r => ({
            id: r.id,
            rating: r.rating,
            title: r.title,
            body: r.body,
            reviewer: r.reviewer.name,
            createdAt: r.created_at,
            productId: r.product_external_id,
            verified: r.verified_buyer
        })));

        page++;
        await new Promise(r => setTimeout(r, 500));
    }

    return reviews;
}
Enter fullscreen mode Exit fullscreen mode

Extracting Reviews from HTML (Generic Approach):

from bs4 import BeautifulSoup
import requests
import re

def extract_reviews_from_page(product_url):
    # Generic review extraction from product page HTML
    response = requests.get(product_url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    })
    soup = BeautifulSoup(response.text, 'html.parser')

    reviews = []

    # Look for common review containers
    review_selectors = [
        '.spr-review',           # Shopify Product Reviews
        '.jdgm-rev',             # Judge.me
        '.loox-review',          # Loox
        '.yotpo-review',         # Yotpo
        '[data-review-id]',      # Generic data attribute
    ]

    for selector in review_selectors:
        review_elements = soup.select(selector)
        if review_elements:
            for elem in review_elements:
                review = {}

                # Extract rating
                stars = elem.select_one('[class*="star"]')
                if stars:
                    rating_match = re.search(r'(\d)', str(stars.get('style', '')))
                    review['rating'] = rating_match.group(1) if rating_match else None

                # Extract review text
                body = elem.select_one('[class*="body"], [class*="content"], [class*="text"]')
                review['body'] = body.get_text(strip=True) if body else ''

                # Extract author
                author = elem.select_one('[class*="author"], [class*="name"]')
                review['author'] = author.get_text(strip=True) if author else 'Anonymous'

                if review.get('body'):
                    reviews.append(review)

            break

    return reviews
Enter fullscreen mode Exit fullscreen mode

Price Monitoring and Inventory Tracking

One of the highest-value applications of Shopify scraping is automated price monitoring.

Building a Price Monitor

const axios = require('axios');
const fs = require('fs');

class ShopifyPriceMonitor {
    constructor(stores) {
        this.stores = stores;
        this.historyFile = 'price_history.json';
        this.history = this.loadHistory();
    }

    loadHistory() {
        try {
            return JSON.parse(fs.readFileSync(this.historyFile, 'utf8'));
        } catch {
            return {};
        }
    }

    saveHistory() {
        fs.writeFileSync(this.historyFile, JSON.stringify(this.history, null, 2));
    }

    async checkPrices() {
        const changes = [];
        const timestamp = new Date().toISOString();

        for (const store of this.stores) {
            try {
                const response = await axios.get(
                    `${store.url}/products.json?limit=250`
                );

                for (const product of response.data.products) {
                    for (const variant of product.variants) {
                        const key = `${store.name}:${variant.id}`;
                        const currentPrice = parseFloat(variant.price);
                        const previousEntry = this.history[key];

                        if (previousEntry) {
                            const previousPrice = previousEntry.price;
                            if (currentPrice !== previousPrice) {
                                const change = {
                                    store: store.name,
                                    product: product.title,
                                    variant: variant.title,
                                    oldPrice: previousPrice,
                                    newPrice: currentPrice,
                                    change: ((currentPrice - previousPrice) / previousPrice * 100).toFixed(2),
                                    timestamp
                                };
                                changes.push(change);
                                console.log(
                                    `PRICE CHANGE: ${product.title} ` +
                                    `$${previousPrice} -> $${currentPrice} ` +
                                    `(${change.change}%)`
                                );
                            }
                        }

                        this.history[key] = {
                            price: currentPrice,
                            available: variant.available,
                            product: product.title,
                            variant: variant.title,
                            lastChecked: timestamp
                        };
                    }
                }

                await new Promise(r => setTimeout(r, 2000));
            } catch (err) {
                console.error(`Error checking ${store.name}: ${err.message}`);
            }
        }

        this.saveHistory();
        return changes;
    }
}

// Usage
const monitor = new ShopifyPriceMonitor([
    { name: 'Store A', url: 'https://store-a.myshopify.com' },
    { name: 'Store B', url: 'https://store-b.myshopify.com' }
]);

monitor.checkPrices().then(changes => {
    console.log(`Found ${changes.length} price changes`);
});
Enter fullscreen mode Exit fullscreen mode

Inventory Level Detection

import requests

def check_inventory_status(store_url):
    # Check inventory availability across all products
    response = requests.get(f"{store_url}/products.json?limit=250")
    products = response.json().get('products', [])

    inventory_report = []

    for product in products:
        for variant in product.get('variants', []):
            status = {
                'product': product['title'],
                'variant': variant['title'],
                'price': variant['price'],
                'available': variant.get('available', False),
                'inventory_policy': variant.get('inventory_policy', 'deny'),
                'inventory_quantity': variant.get('inventory_quantity', 'N/A')
            }
            inventory_report.append(status)

    out_of_stock = [item for item in inventory_report if not item['available']]
    low_stock = [item for item in inventory_report
                 if item['available']
                 and isinstance(item['inventory_quantity'], int)
                 and item['inventory_quantity'] < 10]

    return {
        'total_variants': len(inventory_report),
        'out_of_stock': len(out_of_stock),
        'low_stock': len(low_stock),
        'details': inventory_report
    }
Enter fullscreen mode Exit fullscreen mode

Using Apify Store for Shopify Scraping

While building your own scraper gives you full control, using pre-built actors on the Apify Store can save significant development time.

Apify offers several ready-to-use Shopify scrapers that handle all the edge cases — rate limiting, pagination, proxy rotation, and data formatting. These actors run in the cloud, so you don't need to manage infrastructure.

Key Benefits of Using Apify Actors

  • Proxy management: Automatic rotation to avoid IP blocks
  • Scheduled runs: Set up daily, hourly, or custom monitoring schedules
  • Cloud execution: No local infrastructure needed
  • Data export: Direct export to CSV, JSON, Excel, or webhooks
  • Monitoring dashboards: Track run success rates and data quality

Example: Using Apify SDK

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_TOKEN',
});

async function runShopifyScraper(storeUrl) {
    const run = await client.actor('apify/shopify-scraper').call({
        startUrls: [{ url: storeUrl }],
        maxProducts: 1000,
        includeReviews: true,
        proxy: {
            useApifyProxy: true,
            apifyProxyGroups: ['RESIDENTIAL']
        }
    });

    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    console.log(`Scraped ${items.length} products`);
    return items;
}
Enter fullscreen mode Exit fullscreen mode

Best Practices and Rate Limiting

When scraping Shopify stores, following best practices ensures reliability and avoids issues:

Respectful Scraping Guidelines

  1. Add delays between requests: At minimum 1 second between requests. Shopify's rate limit is typically 2 requests per second for unauthenticated access.

  2. Use meaningful User-Agent strings: Identify your bot clearly rather than pretending to be a browser.

  3. Handle 429 responses gracefully: Implement exponential backoff when rate limited.

  4. Cache responses: Don't re-scrape data that hasn't changed. Check updated_at timestamps.

  5. Respect robots.txt: While /products.json is generally accessible, check the store's robots.txt for any specific restrictions.

Error Handling Pattern

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_resilient_session():
    session = requests.Session()
    retries = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["GET"]
    )
    adapter = HTTPAdapter(max_retries=retries)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session
Enter fullscreen mode Exit fullscreen mode

Data Storage and Analysis

Once you've collected product data, proper storage enables analysis:

import sqlite3
import json

def store_products(products, db_path="shopify_data.db"):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS products (
            id INTEGER PRIMARY KEY,
            title TEXT,
            vendor TEXT,
            product_type TEXT,
            handle TEXT,
            tags TEXT,
            created_at TEXT,
            updated_at TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS variants (
            id INTEGER PRIMARY KEY,
            product_id INTEGER,
            title TEXT,
            price REAL,
            compare_at_price REAL,
            sku TEXT,
            available BOOLEAN,
            FOREIGN KEY (product_id) REFERENCES products(id)
        )
    ''')

    for product in products:
        cursor.execute('''
            INSERT OR REPLACE INTO products
            (id, title, vendor, product_type, handle, tags)
            VALUES (?, ?, ?, ?, ?, ?)
        ''', (
            product['id'], product['title'], product['vendor'],
            product['product_type'], product['handle'],
            json.dumps(product.get('tags', []))
        ))

        for variant in product.get('variants', []):
            cursor.execute('''
                INSERT OR REPLACE INTO variants
                (id, product_id, title, price, compare_at_price, sku, available)
                VALUES (?, ?, ?, ?, ?, ?, ?)
            ''', (
                variant['id'], product['id'], variant['title'],
                float(variant['price']),
                float(variant['compare_at_price']) if variant.get('compare_at_price') else None,
                variant.get('sku'), variant.get('available', False)
            ))

    conn.commit()
    conn.close()
Enter fullscreen mode Exit fullscreen mode

Conclusion

Shopify's consistent architecture makes it one of the most accessible e-commerce platforms for web scraping. The built-in JSON API eliminates much of the complexity you'd face with other platforms, and the predictable URL structure means your scrapers are reliable and maintainable.

Key takeaways:

  • Start with /products.json — it's the fastest path to product data
  • Handle pagination properly — use cursor-based pagination for large catalogs
  • Respect rate limits — 1-2 second delays between requests prevent blocks
  • Monitor prices over time — single snapshots are less valuable than trend data
  • Use cloud platforms like Apify for production workloads that need proxy rotation and scheduling
  • Store data in structured formats for analysis and comparison

Whether you're building a price comparison tool, tracking competitor inventory, or analyzing market trends, the techniques in this guide give you a solid foundation for extracting value from Shopify's vast ecosystem of online stores.

Top comments (0)