agenthustler

Posted on Apr 9 • Edited on Apr 19

How to Scrape Shopify Stores: Products, Reviews, and Store Data

#webdev #javascript #programming #webscraping

Web scraping Shopify stores is one of the most common data extraction tasks in e-commerce intelligence. Whether you're doing competitive analysis, building price comparison tools, or aggregating product catalogs, understanding how to extract data from Shopify-powered stores is an essential skill.

In this comprehensive guide, we'll cover Shopify's architecture, the different approaches to extracting product data, reviews, and store metadata, and how to scale your scraping operations using cloud infrastructure like Apify.

Understanding Shopify's Architecture

Shopify powers over 4.8 million online stores worldwide, making it the single most popular e-commerce platform on the planet. Every Shopify store follows a predictable URL structure and data format, which makes scraping significantly more straightforward than many custom-built platforms.

URL Structure

Shopify stores follow consistent URL patterns that you can rely on:

Products listing: https://store.com/collections/all
Individual product: https://store.com/products/product-handle
Product JSON: https://store.com/products/product-handle.json
Collections: https://store.com/collections/collection-handle
All products JSON: https://store.com/products.json
Collection JSON: https://store.com/collections/all/products.json
Search: https://store.com/search?q=keyword&type=product
Sitemap: https://store.com/sitemap.xml

This predictability is a huge advantage. Unlike custom-built e-commerce sites where every store has a different layout and data structure, you know exactly where to find data on any Shopify store from day one.

The Built-In JSON Endpoints

One of Shopify's best-kept secrets for data extraction is its built-in JSON API. Most Shopify stores expose product data through JSON endpoints without any authentication required:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

The /products.json endpoint supports pagination, allowing you to iterate through all products in a store:

// Paginate through all products
async function getAllProducts(storeUrl) {
    let page = 1;
    let allProducts = [];

    while (true) {
        const url = `${storeUrl}/products.json?limit=250&page=${page}`;
        const response = await fetch(url);
        const data = await response.json();

        if (data.products.length === 0) break;

        allProducts = allProducts.concat(data.products);
        page++;

        // Be respectful - add delay between requests
        await new Promise(resolve => setTimeout(resolve, 1500));
    }

    return allProducts;
}

const products = await getAllProducts('https://example-store.com');
console.log(`Found ${products.length} total products`);

Important caveat: As of late 2024, Shopify started rate-limiting and restricting access to some JSON endpoints on certain stores. Larger stores or those with custom configurations may block unauthenticated JSON access. Always test if the endpoint is accessible before building your entire scraper around it.

Three Methods for Extracting Product Data

Method 1: JSON API (Preferred Approach)

The JSON API returns richly structured product data with everything you could need:

{
  "product": {
    "id": 123456789,
    "title": "Premium Cotton T-Shirt",
    "body_html": "<p>Made from 100% organic cotton...</p>",
    "vendor": "BrandName",
    "product_type": "T-Shirts",
    "created_at": "2024-01-15T10:30:00-05:00",
    "handle": "premium-cotton-t-shirt",
    "tags": ["cotton", "organic", "bestseller"],
    "variants": [
      {
        "id": 987654321,
        "title": "Small / Blue",
        "price": "29.99",
        "compare_at_price": "39.99",
        "sku": "PCT-S-BLU",
        "inventory_quantity": 45,
        "weight": 200,
        "weight_unit": "g"
      }
    ],
    "images": [
      {
        "src": "https://cdn.shopify.com/s/files/1/image.jpg",
        "alt": "Premium Cotton T-Shirt - Blue"
      }
    ]
  }
}

This gives you pricing, variants, inventory levels, images, tags, vendor information, and more — all cleanly structured and ready to process. For competitive analysis, the compare_at_price field is especially valuable because it reveals discount strategies.

Method 2: DOM Scraping with Cheerio

When JSON endpoints are restricted, you fall back to parsing the HTML directly. Shopify themes vary widely, but the Crawlee framework with Cheerio makes this manageable:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

The challenge with DOM scraping is that Shopify has thousands of themes, each with different CSS class names and HTML structures. Using multiple selectors with fallbacks (as shown above) helps handle theme variation.

Method 3: Storefront API (GraphQL)

For stores that expose a Storefront API token (often found in the page source), you can use GraphQL queries:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

The Storefront API is the most reliable data source but requires a valid access token. Many stores embed these tokens in their frontend JavaScript, making them discoverable.

Extracting Review Data

Product reviews are critical for competitive intelligence and sentiment analysis. Shopify doesn't have a native review system, so stores use third-party apps. Each has its own data structure and access method.

Judge.me Reviews

Judge.me is one of the most popular review apps on Shopify. Reviews can be fetched via their public widget API:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Loox and Yotpo Reviews

Yotpo and Loox embed review widgets that load data via their own APIs:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Generic DOM-Based Review Extraction

When you can't identify the review app or access its API, fall back to DOM parsing:

async requestHandler({ $, request }) {
    const reviews = [];

    // Common review selectors across multiple apps
    const reviewSelectors = [
        '.spr-review', '.jdgm-rev', '.yotpo-review',
        '.loox-review', '.review-item', '[data-review-id]'
    ];

    const selector = reviewSelectors.find(s => $(s).length > 0);
    if (!selector) return reviews;

    $(selector).each((i, el) => {
        reviews.push({
            author: $(el).find('.review-author, .spr-review-header-byline, [itemprop="author"]')
                .text().trim(),
            rating: parseFloat(
                $(el).find('[data-rating], .star-rating, [itemprop="ratingValue"]')
                    .attr('data-rating') || $(el).find('.star.filled, .star-icon--full').length
            ),
            date: $(el).find('.review-date, [itemprop="datePublished"]')
                .text().trim(),
            title: $(el).find('.review-title, .spr-review-header-title')
                .text().trim(),
            body: $(el).find('.review-body, .spr-review-content, [itemprop="reviewBody"]')
                .text().trim(),
            verified: $(el).find('.verified-badge, .spr-badge').length > 0
        });
    });

    return reviews;
}

Monitoring Inventory and Stock Levels

Tracking competitor inventory levels provides actionable business intelligence — you can identify best-sellers, detect stockouts, and understand demand patterns:

async function monitorInventory(storeUrl, productHandle) {
    const url = `${storeUrl}/products/${productHandle}.json`;
    const { product } = await fetch(url).then(r => r.json());

    return product.variants.map(variant => ({
        title: variant.title,
        sku: variant.sku,
        available: variant.available,
        inventory_quantity: variant.inventory_quantity,
        price: variant.price,
        compare_at_price: variant.compare_at_price,
        inventory_policy: variant.inventory_policy
    }));
}

// Track changes over time
async function trackInventoryChanges(storeUrl, handles) {
    const snapshot = {};
    for (const handle of handles) {
        snapshot[handle] = await monitorInventory(storeUrl, handle);
        await new Promise(r => setTimeout(r, 1000)); // rate limit
    }
    // Compare with previous snapshot to detect changes
    return snapshot;
}

Note that some stores hide inventory quantities by configuring inventory_policy: 'continue'. In those cases, you can only see available: true/false without exact counts.

Scaling with Apify

When you need to scrape dozens, hundreds, or thousands of Shopify stores, doing it on your local machine becomes impractical. Network limits, IP bans, and compute resources all become bottlenecks. This is where cloud scraping platforms like Apify become essential.

Why Use Apify?

Proxy rotation: Built-in datacenter and residential proxy pools
Scheduling: Run scrapers on cron schedules automatically
Storage: Datasets, key-value stores, and request queues built in
Monitoring: Track run history, errors, and performance
Pre-built actors: Ready-to-use scrapers in the Apify Store

Using Pre-Built Actors from the Apify Store

The fastest way to start scraping Shopify stores is with pre-built actors from the Apify Store. These handle all the edge cases — rate limiting, proxy rotation, theme variations, anti-bot measures — so you can focus on what to do with the data:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

// Run a Shopify scraper from the Apify Store
const run = await client.actor('shopify-scraper-actor').call({
    startUrls: [
        { url: 'https://store1.com' },
        { url: 'https://store2.com' },
        { url: 'https://store3.com' }
    ],
    maxProducts: 1000,
    includeReviews: true,
    proxyConfiguration: {
        useApifyProxy: true,
        apifyProxyGroups: ['RESIDENTIAL']
    }
});

// Fetch all results
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Scraped ${items.length} products across 3 stores`);

// Export to CSV
const csvBuffer = await client.dataset(run.defaultDatasetId)
    .downloadItems('csv');

Building a Custom Apify Actor

When you need custom extraction logic, the Crawlee framework combined with Apify's infrastructure is powerful:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Handling Anti-Scraping Measures

Rate Limiting

Always be respectful. Hammering a store with requests can get your IP banned and potentially cause legal issues:

const crawler = new CheerioCrawler({
    maxConcurrency: 3,
    maxRequestsPerMinute: 30,
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ request, $ }) {
        // Add randomized delays to appear more human
        await new Promise(resolve =>
            setTimeout(resolve, 1000 + Math.random() * 2000)
        );
        // ... extraction logic
    }
});

JavaScript-Rendered Content

Some Shopify themes heavily rely on JavaScript for rendering product data. When Cheerio isn't enough, switch to Playwright:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Best Practices for Shopify Scraping

Always try JSON endpoints first — they're faster, more reliable, and return cleaner data than DOM scraping.
Respect robots.txt — check https://store.com/robots.txt before scraping. Many Shopify stores disallow certain paths.
Rate limit aggressively — keep requests under 30 per minute per store. Use random delays to vary your request pattern.
Leverage structured data — Shopify themes embed JSON-LD, Open Graph, and meta tags. Use these before parsing arbitrary HTML.
Handle errors gracefully — stores go down, pages get removed, products sell out. Build retry logic with exponential backoff.
Cache where possible — if scraping the same store daily, only re-fetch products that changed. Use ETags or Last-Modified headers.
Use residential proxies for scale — datacenter IPs get blocked quickly on popular stores. Residential proxies on Apify last much longer.
Monitor your scraper — set up alerts for sudden drops in data volume, which usually indicate blocking or site changes.

Legal and Ethical Considerations

Web scraping legality varies by jurisdiction, but there are universal best practices:

Only scrape publicly available data — never bypass authentication or access restricted areas
Read and respect Terms of Service — some stores explicitly prohibit automated access
Don't overload servers — excessive concurrent requests can constitute denial of service
Comply with GDPR/CCPA — if collecting personal data (reviewer names, etc.), ensure proper compliance
Use data responsibly — scraped data should inform business decisions, not enable spam or unfair practices
Consider the store owner — would you be comfortable if someone scraped your store this way?

Conclusion

Shopify's predictable architecture makes it one of the most scraper-friendly e-commerce platforms. Whether you use the built-in JSON endpoints, DOM scraping with Cheerio, or the Storefront GraphQL API, the key is choosing the right approach for your specific use case and scaling needs.

For small-scale competitive analysis, a simple script using the JSON endpoints might be all you need. For large-scale market intelligence across hundreds of stores, leveraging cloud infrastructure like the Apify Store with its pre-built actors and proxy management will save you significant development and operational overhead.

The best approach is layered: start with JSON APIs, fall back to structured data (JSON-LD), then DOM parsing, and finally JavaScript rendering — each progressively more complex but more robust. Combined with respectful rate limiting and proper proxy rotation, you can build reliable Shopify scraping pipelines that deliver actionable e-commerce intelligence at scale.

DEV Community