DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Shopify Stores: Products, Reviews, and Store Data

Web scraping Shopify stores is one of the most common data extraction tasks in e-commerce intelligence. Whether you're doing competitive analysis, building price comparison tools, or aggregating product catalogs, understanding how to extract data from Shopify-powered stores is an essential skill.

In this comprehensive guide, we'll cover Shopify's architecture, the different approaches to extracting product data, reviews, and store metadata, and how to scale your scraping operations using cloud infrastructure like Apify.


Understanding Shopify's Architecture

Shopify powers over 4.8 million online stores worldwide, making it the single most popular e-commerce platform on the planet. Every Shopify store follows a predictable URL structure and data format, which makes scraping significantly more straightforward than many custom-built platforms.

URL Structure

Shopify stores follow consistent URL patterns that you can rely on:

  • Products listing: https://store.com/collections/all
  • Individual product: https://store.com/products/product-handle
  • Product JSON: https://store.com/products/product-handle.json
  • Collections: https://store.com/collections/collection-handle
  • All products JSON: https://store.com/products.json
  • Collection JSON: https://store.com/collections/all/products.json
  • Search: https://store.com/search?q=keyword&type=product
  • Sitemap: https://store.com/sitemap.xml

This predictability is a huge advantage. Unlike custom-built e-commerce sites where every store has a different layout and data structure, you know exactly where to find data on any Shopify store from day one.

The Built-In JSON Endpoints

One of Shopify's best-kept secrets for data extraction is its built-in JSON API. Most Shopify stores expose product data through JSON endpoints without any authentication required:

// Fetch products from any Shopify store
const response = await fetch('https://example-store.com/products.json?limit=250');
const data = await response.json();

console.log(data.products.length); // Up to 250 products per page
console.log(data.products[0].title);
console.log(data.products[0].variants[0].price);
Enter fullscreen mode Exit fullscreen mode

The /products.json endpoint supports pagination, allowing you to iterate through all products in a store:

// Paginate through all products
async function getAllProducts(storeUrl) {
    let page = 1;
    let allProducts = [];

    while (true) {
        const url = `${storeUrl}/products.json?limit=250&page=${page}`;
        const response = await fetch(url);
        const data = await response.json();

        if (data.products.length === 0) break;

        allProducts = allProducts.concat(data.products);
        page++;

        // Be respectful - add delay between requests
        await new Promise(resolve => setTimeout(resolve, 1500));
    }

    return allProducts;
}

const products = await getAllProducts('https://example-store.com');
console.log(`Found ${products.length} total products`);
Enter fullscreen mode Exit fullscreen mode

Important caveat: As of late 2024, Shopify started rate-limiting and restricting access to some JSON endpoints on certain stores. Larger stores or those with custom configurations may block unauthenticated JSON access. Always test if the endpoint is accessible before building your entire scraper around it.


Three Methods for Extracting Product Data

Method 1: JSON API (Preferred Approach)

The JSON API returns richly structured product data with everything you could need:

{
  "product": {
    "id": 123456789,
    "title": "Premium Cotton T-Shirt",
    "body_html": "<p>Made from 100% organic cotton...</p>",
    "vendor": "BrandName",
    "product_type": "T-Shirts",
    "created_at": "2024-01-15T10:30:00-05:00",
    "handle": "premium-cotton-t-shirt",
    "tags": ["cotton", "organic", "bestseller"],
    "variants": [
      {
        "id": 987654321,
        "title": "Small / Blue",
        "price": "29.99",
        "compare_at_price": "39.99",
        "sku": "PCT-S-BLU",
        "inventory_quantity": 45,
        "weight": 200,
        "weight_unit": "g"
      }
    ],
    "images": [
      {
        "src": "https://cdn.shopify.com/s/files/1/image.jpg",
        "alt": "Premium Cotton T-Shirt - Blue"
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

This gives you pricing, variants, inventory levels, images, tags, vendor information, and more — all cleanly structured and ready to process. For competitive analysis, the compare_at_price field is especially valuable because it reveals discount strategies.

Method 2: DOM Scraping with Cheerio

When JSON endpoints are restricted, you fall back to parsing the HTML directly. Shopify themes vary widely, but the Crawlee framework with Cheerio makes this manageable:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        // Product title - most themes use h1
        const title = $('h1.product-title, h1.product__title, h1[itemprop="name"]')
            .first().text().trim();

        // Price extraction - look for common patterns
        const price = $('[data-product-price], .product-price .money, .price-item--regular')
            .first().text().trim();
        const comparePrice = $('[data-compare-price], .price-item--sale, .product-price__compare')
            .first().text().trim();

        // Description
        const description = $('.product-description, .product__description')
            .text().trim();

        // Images - Shopify lazy-loads with data-src
        const images = [];
        $('img[data-src], img.product__image').each((i, el) => {
            const src = $(el).attr('data-src') || $(el).attr('src');
            if (src && src.includes('cdn.shopify.com')) {
                images.push(src.replace(/\_{width}x/, ''));
            }
        });

        // Many themes embed variant JSON in a script tag
        let variants = [];
        $('script').each((i, el) => {
            const text = $(el).html() || '';
            const match = text.match(/"variants"\s*:\s*(\[.*?\])/s);
            if (match) {
                try { variants = JSON.parse(match[1]); } catch {}
            }
        });

        console.log({ title, price, comparePrice, images: images.length, variants: variants.length });
    }
});
Enter fullscreen mode Exit fullscreen mode

The challenge with DOM scraping is that Shopify has thousands of themes, each with different CSS class names and HTML structures. Using multiple selectors with fallbacks (as shown above) helps handle theme variation.

Method 3: Storefront API (GraphQL)

For stores that expose a Storefront API token (often found in the page source), you can use GraphQL queries:

// Find the Storefront token in page source
// Look for: X-Shopify-Storefront-Access-Token or accessToken
const storefrontToken = 'found-in-page-source';

const query = `{
  products(first: 50, after: null) {
    pageInfo {
      hasNextPage
      endCursor
    }
    edges {
      node {
        title
        description
        handle
        productType
        vendor
        priceRange {
          minVariantPrice { amount currencyCode }
          maxVariantPrice { amount currencyCode }
        }
        variants(first: 20) {
          edges {
            node {
              title
              price { amount currencyCode }
              availableForSale
              sku
            }
          }
        }
        images(first: 5) {
          edges {
            node { url altText }
          }
        }
      }
    }
  }
}`;

const response = await fetch(
    'https://store.myshopify.com/api/2024-01/graphql.json',
    {
        method: 'POST',
        headers: {
            'Content-Type': 'application/json',
            'X-Shopify-Storefront-Access-Token': storefrontToken
        },
        body: JSON.stringify({ query })
    }
);

const data = await response.json();
const products = data.data.products.edges.map(e => e.node);
Enter fullscreen mode Exit fullscreen mode

The Storefront API is the most reliable data source but requires a valid access token. Many stores embed these tokens in their frontend JavaScript, making them discoverable.


Extracting Review Data

Product reviews are critical for competitive intelligence and sentiment analysis. Shopify doesn't have a native review system, so stores use third-party apps. Each has its own data structure and access method.

Judge.me Reviews

Judge.me is one of the most popular review apps on Shopify. Reviews can be fetched via their public widget API:

async function getJudgeMeReviews(shopDomain, productId) {
    const url = `https://judge.me/api/v1/reviews?` +
        `shop_domain=${shopDomain}&` +
        `api_token=PUBLIC_TOKEN&` +
        `product_id=${productId}&` +
        `per_page=50&page=1`;

    const response = await fetch(url);
    const data = await response.json();

    return data.reviews.map(r => ({
        author: r.reviewer.name,
        rating: r.rating,
        title: r.title,
        body: r.body,
        date: r.created_at,
        verified: r.verified === 'buyer',
        images: r.pictures?.map(p => p.urls.original) || []
    }));
}
Enter fullscreen mode Exit fullscreen mode

Loox and Yotpo Reviews

Yotpo and Loox embed review widgets that load data via their own APIs:

// Yotpo - find the app key in page source
const appKey = pageSource.match(/yotpoAppKey['":\s]+['"](\w+)['"]/)?.[1];
const yotpoUrl = `https://api.yotpo.com/v1/widget/${appKey}/products/${productId}/reviews.json?per_page=50&page=1`;

// Loox - reviews are often in iframes
// Parse the Loox widget URL from the page
const looxReviews = await page.evaluate(() => {
    const items = document.querySelectorAll('.loox-review');
    return Array.from(items).map(el => ({
        author: el.querySelector('.loox-review-author')?.textContent,
        rating: el.querySelectorAll('.loox-star.loox-filled').length,
        body: el.querySelector('.loox-review-content')?.textContent
    }));
});
Enter fullscreen mode Exit fullscreen mode

Generic DOM-Based Review Extraction

When you can't identify the review app or access its API, fall back to DOM parsing:

async requestHandler({ $, request }) {
    const reviews = [];

    // Common review selectors across multiple apps
    const reviewSelectors = [
        '.spr-review', '.jdgm-rev', '.yotpo-review',
        '.loox-review', '.review-item', '[data-review-id]'
    ];

    const selector = reviewSelectors.find(s => $(s).length > 0);
    if (!selector) return reviews;

    $(selector).each((i, el) => {
        reviews.push({
            author: $(el).find('.review-author, .spr-review-header-byline, [itemprop="author"]')
                .text().trim(),
            rating: parseFloat(
                $(el).find('[data-rating], .star-rating, [itemprop="ratingValue"]')
                    .attr('data-rating') || $(el).find('.star.filled, .star-icon--full').length
            ),
            date: $(el).find('.review-date, [itemprop="datePublished"]')
                .text().trim(),
            title: $(el).find('.review-title, .spr-review-header-title')
                .text().trim(),
            body: $(el).find('.review-body, .spr-review-content, [itemprop="reviewBody"]')
                .text().trim(),
            verified: $(el).find('.verified-badge, .spr-badge').length > 0
        });
    });

    return reviews;
}
Enter fullscreen mode Exit fullscreen mode

Monitoring Inventory and Stock Levels

Tracking competitor inventory levels provides actionable business intelligence — you can identify best-sellers, detect stockouts, and understand demand patterns:

async function monitorInventory(storeUrl, productHandle) {
    const url = `${storeUrl}/products/${productHandle}.json`;
    const { product } = await fetch(url).then(r => r.json());

    return product.variants.map(variant => ({
        title: variant.title,
        sku: variant.sku,
        available: variant.available,
        inventory_quantity: variant.inventory_quantity,
        price: variant.price,
        compare_at_price: variant.compare_at_price,
        inventory_policy: variant.inventory_policy
    }));
}

// Track changes over time
async function trackInventoryChanges(storeUrl, handles) {
    const snapshot = {};
    for (const handle of handles) {
        snapshot[handle] = await monitorInventory(storeUrl, handle);
        await new Promise(r => setTimeout(r, 1000)); // rate limit
    }
    // Compare with previous snapshot to detect changes
    return snapshot;
}
Enter fullscreen mode Exit fullscreen mode

Note that some stores hide inventory quantities by configuring inventory_policy: 'continue'. In those cases, you can only see available: true/false without exact counts.


Scaling with Apify

When you need to scrape dozens, hundreds, or thousands of Shopify stores, doing it on your local machine becomes impractical. Network limits, IP bans, and compute resources all become bottlenecks. This is where cloud scraping platforms like Apify become essential.

Why Use Apify?

  • Proxy rotation: Built-in datacenter and residential proxy pools
  • Scheduling: Run scrapers on cron schedules automatically
  • Storage: Datasets, key-value stores, and request queues built in
  • Monitoring: Track run history, errors, and performance
  • Pre-built actors: Ready-to-use scrapers in the Apify Store

Using Pre-Built Actors from the Apify Store

The fastest way to start scraping Shopify stores is with pre-built actors from the Apify Store. These handle all the edge cases — rate limiting, proxy rotation, theme variations, anti-bot measures — so you can focus on what to do with the data:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

// Run a Shopify scraper from the Apify Store
const run = await client.actor('shopify-scraper-actor').call({
    startUrls: [
        { url: 'https://store1.com' },
        { url: 'https://store2.com' },
        { url: 'https://store3.com' }
    ],
    maxProducts: 1000,
    includeReviews: true,
    proxyConfiguration: {
        useApifyProxy: true,
        apifyProxyGroups: ['RESIDENTIAL']
    }
});

// Fetch all results
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Scraped ${items.length} products across 3 stores`);

// Export to CSV
const csvBuffer = await client.dataset(run.defaultDatasetId)
    .downloadItems('csv');
Enter fullscreen mode Exit fullscreen mode

Building a Custom Apify Actor

When you need custom extraction logic, the Crawlee framework combined with Apify's infrastructure is powerful:

import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput();
const { startUrls, maxProducts = 500, extractReviews = false } = input;

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: maxProducts,
    maxConcurrency: 5,

    async requestHandler({ request, $, enqueueLinks }) {
        const url = request.url;

        // On collection pages, enqueue product links
        if (url.includes('/collections/')) {
            await enqueueLinks({
                selector: 'a[href*="/products/"]',
                baseUrl: request.loadedUrl
            });

            // Handle pagination
            const nextPage = $('a.pagination__next, [rel="next"]').attr('href');
            if (nextPage) {
                await enqueueLinks({ urls: [new URL(nextPage, url).href] });
            }
            return;
        }

        // On product pages, try JSON first
        try {
            const jsonUrl = url.endsWith('/') ? url + '.json' : url + '.json';
            const jsonResp = await fetch(jsonUrl);
            if (jsonResp.ok) {
                const { product } = await jsonResp.json();
                await Dataset.pushData({
                    url,
                    source: 'json_api',
                    title: product.title,
                    vendor: product.vendor,
                    productType: product.product_type,
                    tags: product.tags,
                    variants: product.variants.map(v => ({
                        title: v.title, price: v.price, sku: v.sku,
                        available: v.available
                    })),
                    images: product.images.map(i => i.src),
                    scrapedAt: new Date().toISOString()
                });
                return;
            }
        } catch {}

        // Fallback to DOM parsing
        const jsonLd = $('script[type="application/ld+json"]').html();
        const structured = JSON.parse(jsonLd || '{}');

        await Dataset.pushData({
            url,
            source: 'dom_parsing',
            title: structured.name || $('h1').first().text().trim(),
            price: structured.offers?.price,
            currency: structured.offers?.priceCurrency,
            description: structured.description?.substring(0, 500),
            image: structured.image,
            brand: structured.brand?.name,
            scrapedAt: new Date().toISOString()
        });
    }
});

await crawler.run(startUrls.map(u => u.url));
await Actor.exit();
Enter fullscreen mode Exit fullscreen mode

Handling Anti-Scraping Measures

Rate Limiting

Always be respectful. Hammering a store with requests can get your IP banned and potentially cause legal issues:

const crawler = new CheerioCrawler({
    maxConcurrency: 3,
    maxRequestsPerMinute: 30,
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ request, $ }) {
        // Add randomized delays to appear more human
        await new Promise(resolve =>
            setTimeout(resolve, 1000 + Math.random() * 2000)
        );
        // ... extraction logic
    }
});
Enter fullscreen mode Exit fullscreen mode

JavaScript-Rendered Content

Some Shopify themes heavily rely on JavaScript for rendering product data. When Cheerio isn't enough, switch to Playwright:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: { headless: true }
    },
    async requestHandler({ page, request }) {
        await page.waitForSelector('.product-grid, .collection-products', {
            timeout: 15000
        });

        const products = await page.evaluate(() => {
            return Array.from(document.querySelectorAll('.product-card')).map(el => ({
                title: el.querySelector('.product-title, .card__heading')?.textContent?.trim(),
                price: el.querySelector('.price, .money')?.textContent?.trim(),
                link: el.querySelector('a')?.href,
                image: el.querySelector('img')?.src
            }));
        });

        for (const product of products) {
            await Dataset.pushData(product);
        }
    }
});
Enter fullscreen mode Exit fullscreen mode

Best Practices for Shopify Scraping

  1. Always try JSON endpoints first — they're faster, more reliable, and return cleaner data than DOM scraping.

  2. Respect robots.txt — check https://store.com/robots.txt before scraping. Many Shopify stores disallow certain paths.

  3. Rate limit aggressively — keep requests under 30 per minute per store. Use random delays to vary your request pattern.

  4. Leverage structured data — Shopify themes embed JSON-LD, Open Graph, and meta tags. Use these before parsing arbitrary HTML.

  5. Handle errors gracefully — stores go down, pages get removed, products sell out. Build retry logic with exponential backoff.

  6. Cache where possible — if scraping the same store daily, only re-fetch products that changed. Use ETags or Last-Modified headers.

  7. Use residential proxies for scale — datacenter IPs get blocked quickly on popular stores. Residential proxies on Apify last much longer.

  8. Monitor your scraper — set up alerts for sudden drops in data volume, which usually indicate blocking or site changes.


Legal and Ethical Considerations

Web scraping legality varies by jurisdiction, but there are universal best practices:

  • Only scrape publicly available data — never bypass authentication or access restricted areas
  • Read and respect Terms of Service — some stores explicitly prohibit automated access
  • Don't overload servers — excessive concurrent requests can constitute denial of service
  • Comply with GDPR/CCPA — if collecting personal data (reviewer names, etc.), ensure proper compliance
  • Use data responsibly — scraped data should inform business decisions, not enable spam or unfair practices
  • Consider the store owner — would you be comfortable if someone scraped your store this way?

Conclusion

Shopify's predictable architecture makes it one of the most scraper-friendly e-commerce platforms. Whether you use the built-in JSON endpoints, DOM scraping with Cheerio, or the Storefront GraphQL API, the key is choosing the right approach for your specific use case and scaling needs.

For small-scale competitive analysis, a simple script using the JSON endpoints might be all you need. For large-scale market intelligence across hundreds of stores, leveraging cloud infrastructure like the Apify Store with its pre-built actors and proxy management will save you significant development and operational overhead.

The best approach is layered: start with JSON APIs, fall back to structured data (JSON-LD), then DOM parsing, and finally JavaScript rendering — each progressively more complex but more robust. Combined with respectful rate limiting and proper proxy rotation, you can build reliable Shopify scraping pipelines that deliver actionable e-commerce intelligence at scale.

Top comments (0)