Ivan Korostenskij

Posted on Nov 27

The Shopify products.json Trick: Scrape Any Store 25x Faster with Python

#python #programming #tutorial #learning

If you’ve ever done any web scraping, you know how annoying it is to parse HTML: finding parents of elements, navigating through a web of xml tags, hoping CSS selectors don’t change over time and break everything.

However, on Shopify - one of the largest and most useful repositories of product data - contains a trick that allows us to bypass this process entirely, leaving us with beautiful JSON right out the gate.

I recently built a fragrance drop tracker that scrapes dozens of Shopify
stores in real-time. Let’s walk through how and why this trick made that trivial.

The secret is `/products.json`

Every Shopify store exposes a secret endpoint: /products.json.

For perspective, I invite you to click on some links (promise they’re safe)

This is a website that sells niche perfumes in the spirit of animals (Cow, Penguin, T-rex…)

This first link is their direct home page

https://www.zoologistperfumes.com/

Now check this out:

https://www.zoologistperfumes.com/products.json?limit=10&page=1

{
  "products": [
    {
      "id": 1234567890,
      "title": "Bee (2023)",
      "handle": "bee-2023",
      "vendor": "Zoologist",
      "product_type": "Extrait de Parfum",
      "tags": ["honey", "beeswax", "floral"],
      "variants": [
        {
          "id": 9876543210,
          "title": "60ml",
          "price": "195.00",
          "available": true,
          "sku": "ZOO-BEE-60"
        }
      ],
      "images": [
        {
          "src": "https://cdn.shopify.com/...",
          "width": 1000,
          "height": 1000
        }
      ]
    }
  ]
  ... bunch of more stuff here ...
}

That’s money. Let’s walk through how we can fetch every product off this site in under a second with Python.

The code below is copy-pastable, and I’d highly encourage you to follow along as once you understand this, web scraping becomes a whole lot easier.

import asyncio

import httpx
from pydantic import BaseModel

# Define a few models so the nicely-formed JSON flows
#    smoothly into Python objects instead of opaque dictionaries
class ShopifyVariant(BaseModel):
    id: int
    title: str
    price: str
    available: bool
    sku: str

class ShopifyProduct(BaseModel):
    id: int
    title: str
    vendor: str
    tags: list[str]
    variants: list[ShopifyVariant]

class ShopifyResponse(BaseModel):
    products: list[ShopifyProduct]

async def fetch_product(client: httpx.AsyncClient, url: str) -> ShopifyResponse:
    response: httpx.Response = await client.get(url)

    return ShopifyResponse(**response.json())

async def scrape_shopify_products(base_url: str, client: httpx.AsyncClient) -> list[ShopifyProduct]:
    all_products: list[ShopifyProduct] = []
    page_number: int = 1

    while True:
        products: ShopifyResponse = await fetch_product(
            client=client,
            url=f"{base_url}/products.json?page={page_number}"
        )

        if len(products.products) == 0:
            break

        all_products.extend(products.products)
        page_number += 1

    return all_products

async def main() -> None:
    async with httpx.AsyncClient() as client:
        products: list[ShopifyProduct] = await scrape_shopify_products(
            base_url="https://www.zoologistperfumes.com",
            client=client
        )

    print(f"Found {len(products)} products")
    # Output: Found 115 products

if __name__ == "__main__":
    asyncio.run(main())

Want a better understand of how to efficiently batch mass amounts of async API calls in Python? Check out this article:

What Modern Python Uses for Async API Calls: HTTPX & TaskGroups

Ivan Korostenskij ・ Nov 26

#python #tutorial #learning #productivity

Pydantic SOLVES the issue of parsing JSON in these scenarios, use it

By setting up these Pydantic models, we give ourselves automatic serialization of the raw JSON payload, documentation of exactly how the payload looks, and type-safe data access in our editors.

Plus, when you want to transform this data, you’re working with structured objects instead of dict soup.

Here’s how to scrape 5 sites in the same time as 1

async def main() -> None:
    shopify_sites_list: list[str] = [
        "https://www.zoologistperfumes.com",
        "https://bruvi.com/",
        "https://flourist.com/en-us",
        "https://packagefreeshop.com/",
        "https://meowmeowtweet.com/",
    ]

    products: list[ShopifyProduct] = []

    async with httpx.AsyncClient() as client:
        try:
            async with asyncio.TaskGroup() as task_group:
                tasks: list[asyncio.Task[list[ShopifyProduct]]] = [
                    task_group.create_task(
                        scrape_shopify_products(
                            base_url=site,
                            client=client
                        )
                    )
                    for site in shopify_sites_list
                ]
            for task in tasks:
                products.extend(task.result())

        except* Exception as eg:
            for error in eg.exceptions:
                print(f"Got an error: {error}")

    print(f"Found {len(products)} products")
    # Output: Found 626 products

if __name__ == "__main__":
    asyncio.run(main())

We’re adding a task group that gathers and executes calls to each website asynchronously, so no single website with 1000 pages blocks any others.

Make scraping 25x faster with binary search

Right now, our sequential scraper has to check page 1, page 2, page 3... until it hits an empty page. That's slow. This story explains how we fix that:

A Cambridge computer science professor got his bike stolen on campus one weekend. He went to the police and was relieved to find that they had a camera posted in plain view of his missing bike.

However, the police argued that they didn't have time to sift through all the footage, so he was out of luck.

So, he tried arguing, telling them:

If they forwarded the footage to Sunday night and the bike hadn’t been stolen, that meant it was stolen Sunday (24 hour window)
Repeating this again, but at noon on Sunday: if bike was still there, it had to have been stolen between 12pm-12am that day (12 hour window)
Doing this 7 more more times, you narrow down a 5 minute window where the thief shows up

After they refused to listen, he angrily made the point that if they had to sift through camera footage since the cretaceous period, it would’ve taken less than an hour - following his approach - to find the exact second of theft.

Figure 1: Comparison of binary (top) vs sequential (bottom) search, both searching for the target number 37, with binary search drastically winning out. This target number represents the last page of products in the sites we're scraping.

Let’s apply this. Right now, our algorithm doesn’t know how many pages there are, so instead of unleashing a hoard of API calls to collect the data concurrently, we have to tip toe around the unknown final page.

If we first:

Find that last page through binary search
Cache it

We know the stopping point and send dozens of concurrent requests to grab the data, drastically speeding things up.

async def find_last_page(base_url: str, client: httpx.AsyncClient) -> int:
    low, high = 1, 1000
    last_valid_page = 1

    while low <= high:
        mid = (low + high) // 2
        url = f"{base_url}/products.json?page={mid}"

        response: httpx.Response = await client.get(url)
        data: ShopifyResponse = ShopifyResponse(**response.json())

        if data.products:
            last_valid_page = mid
            low = mid + 1  # Try higher
        else:
            high = mid - 1  # Try lower

    return last_valid_page

async def scrape_all_pages_fast(base_url: str, client: httpx.AsyncClient) -> list[ShopifyProduct]:
    last_page: int = await find_last_page(base_url, client)
    all_products: list[ShopifyProduct] = []

    async with asyncio.TaskGroup() as tg:
        tasks: list[asyncio.Task[ShopifyResponse]] = [
            tg.create_task(
                fetch_product(client, f"{base_url}/products.json?page={page}")
            )
            for page in range(1, last_page + 1)
        ]

    for task in tasks:
        all_products.extend(task.result().products)

    return all_products

Here are the numbers:

Sequential Approach: We scrape page by page until we find the last one

Binary Search Approach: We find the last page with binary search, then concurrently scrape every page

Math:

I tested 10 websites at once averaging 23 pages (600 products) each.

The first time, the sequential approach took an average 3.10s while the binary approach took almost double - 5.926s.

The following runs, once that last page was cached, the binary approach scraping went 5.94s -> 0.23s, a 25.8x increase!!

In smaller runs, the investment of binary search pays off after 2 runs, while on larger tasks it wipes the floor.

Stress test

I also found a site with the max amount of products (25000, 833 pages, "https://pelacase.ca/").

	Sequential	Binary
First Run Time	120.1s	11.56s
Second Run (cached)	101.4s	5.26s

Important Nuance

Custom Shopify sites sometimes seem to turn this off, so look before you scrape

https://cowboy.com/products.json

"Fancy algorithms are slow when n is small, and n is usually small."

Rob Pike

Don’t prematurely optimize what you don’t need to. A simple solution to the potential mass page problem, instead of concurrent pagination + binary search, is setting the item limit of the page you’re scraping to something ridiculous. Then you only have to scrape 1 page (given many other tradeoffs).

www.zoologistperfumes.com/products.json?limit=1000000&page=1

Binary search is just another tool. You won't always need it, but knowing it exists changes what problems you can now solve efficiently when the time comes.

Conclusion

That’s it. You know:

The backdoor products.json endpoint
How to scrape multiple sites concurrently (TaskGroups, async)
How to optimize pagination with binary search

More importantly, you learned that to optimize something, you often need to invest time illuminating it’s boundaries.

So pick a Shopify site, run the code, and watch it pull hundreds of products/second; add binary search and watch it get 25x faster.

Questions about web scraping or async Python? Drop a comment below.

Follow for more tutorials on building real production tools. :)

DEV Community

The Shopify products.json Trick: Scrape Any Store 25x Faster with Python

The secret is `/products.json`

What Modern Python Uses for Async API Calls: HTTPX & TaskGroups

Ivan Korostenskij ・ Nov 26

Pydantic SOLVES the issue of parsing JSON in these scenarios, use it

Here’s how to scrape 5 sites in the same time as 1

Make scraping 25x faster with binary search

Important Nuance

Conclusion

Top comments (0)

The secret is /products.json

What Modern Python Uses for Async API Calls: HTTPX & TaskGroups

Ivan Korostenskij ・ Nov 26

Pydantic SOLVES the issue of parsing JSON in these scenarios, use it

Here’s how to scrape 5 sites in the same time as 1

Make scraping 25x faster with binary search

Important Nuance

Conclusion

The secret is `/products.json`