If you’ve ever done any web scraping, you know how annoying it is to parse HTML: finding parents of elements, navigating through a web of xml tags, hoping CSS selectors don’t change over time and break everything.
However, on Shopify - one of the largest and most useful repositories of product data - contains a trick that allows us to bypass this process entirely, leaving us with beautiful JSON right out the gate.
I recently built a fragrance drop tracker that scrapes dozens of Shopify
stores in real-time. Let’s walk through how and why this trick made that trivial.
The secret is /products.json
Every Shopify store exposes a secret endpoint: /products.json.
For perspective, I invite you to click on some links (promise they’re safe)
This is a website that sells niche perfumes in the spirit of animals (Cow, Penguin, T-rex…)
This first link is their direct home page
Now check this out:
{
"products": [
{
"id": 1234567890,
"title": "Bee (2023)",
"handle": "bee-2023",
"vendor": "Zoologist",
"product_type": "Extrait de Parfum",
"tags": ["honey", "beeswax", "floral"],
"variants": [
{
"id": 9876543210,
"title": "60ml",
"price": "195.00",
"available": true,
"sku": "ZOO-BEE-60"
}
],
"images": [
{
"src": "https://cdn.shopify.com/...",
"width": 1000,
"height": 1000
}
]
}
]
... bunch of more stuff here ...
}
That’s money. Let’s walk through how we can fetch every product off this site in under a second with Python.
The code below is copy-pastable, and I’d highly encourage you to follow along as once you understand this, web scraping becomes a whole lot easier.
import asyncio
import httpx
from pydantic import BaseModel
# Define a few models so the nicely-formed JSON flows
# smoothly into Python objects instead of opaque dictionaries
class ShopifyVariant(BaseModel):
id: int
title: str
price: str
available: bool
sku: str
class ShopifyProduct(BaseModel):
id: int
title: str
vendor: str
tags: list[str]
variants: list[ShopifyVariant]
class ShopifyResponse(BaseModel):
products: list[ShopifyProduct]
async def fetch_product(client: httpx.AsyncClient, url: str) -> ShopifyResponse:
response: httpx.Response = await client.get(url)
return ShopifyResponse(**response.json())
async def scrape_shopify_products(base_url: str, client: httpx.AsyncClient) -> list[ShopifyProduct]:
all_products: list[ShopifyProduct] = []
page_number: int = 1
while True:
products: ShopifyResponse = await fetch_product(
client=client,
url=f"{base_url}/products.json?page={page_number}"
)
if len(products.products) == 0:
break
all_products.extend(products.products)
page_number += 1
return all_products
async def main() -> None:
async with httpx.AsyncClient() as client:
products: list[ShopifyProduct] = await scrape_shopify_products(
base_url="https://www.zoologistperfumes.com",
client=client
)
print(f"Found {len(products)} products")
# Output: Found 115 products
if __name__ == "__main__":
asyncio.run(main())
Want a better understand of how to efficiently batch mass amounts of async API calls in Python? Check out this article:
![]()
What Modern Python Uses for Async API Calls: HTTPX & TaskGroups
Ivan Korostenskij ・ Nov 26
#python #tutorial #learning #productivity
Pydantic SOLVES the issue of parsing JSON in these scenarios, use it
By setting up these Pydantic models, we give ourselves automatic serialization of the raw JSON payload, documentation of exactly how the payload looks, and type-safe data access in our editors.
Plus, when you want to transform this data, you’re working with structured objects instead of dict soup.
Here’s how to scrape 5 sites in the same time as 1
async def main() -> None:
shopify_sites_list: list[str] = [
"https://www.zoologistperfumes.com",
"https://bruvi.com/",
"https://flourist.com/en-us",
"https://packagefreeshop.com/",
"https://meowmeowtweet.com/",
]
products: list[ShopifyProduct] = []
async with httpx.AsyncClient() as client:
try:
async with asyncio.TaskGroup() as task_group:
tasks: list[asyncio.Task[list[ShopifyProduct]]] = [
task_group.create_task(
scrape_shopify_products(
base_url=site,
client=client
)
)
for site in shopify_sites_list
]
for task in tasks:
products.extend(task.result())
except* Exception as eg:
for error in eg.exceptions:
print(f"Got an error: {error}")
print(f"Found {len(products)} products")
# Output: Found 626 products
if __name__ == "__main__":
asyncio.run(main())
We’re adding a task group that gathers and executes calls to each website asynchronously, so no single website with 1000 pages blocks any others.
Make scraping 25x faster with binary search
Right now, our sequential scraper has to check page 1, page 2, page 3... until it hits an empty page. That's slow. This story explains how we fix that:
A Cambridge computer science professor got his bike stolen on campus one weekend. He went to the police and was relieved to find that they had a camera posted in plain view of his missing bike.
However, the police argued that they didn't have time to sift through all the footage, so he was out of luck.
So, he tried arguing, telling them:
- If they forwarded the footage to Sunday night and the bike hadn’t been stolen, that meant it was stolen Sunday (24 hour window)
- Repeating this again, but at noon on Sunday: if bike was still there, it had to have been stolen between 12pm-12am that day (12 hour window)
- Doing this 7 more more times, you narrow down a 5 minute window where the thief shows up
After they refused to listen, he angrily made the point that if they had to sift through camera footage since the cretaceous period, it would’ve taken less than an hour - following his approach - to find the exact second of theft.
Figure 1: Comparison of binary (top) vs sequential (bottom) search, both searching for the target number 37, with binary search drastically winning out. This target number represents the last page of products in the sites we're scraping.
Let’s apply this. Right now, our algorithm doesn’t know how many pages there are, so instead of unleashing a hoard of API calls to collect the data concurrently, we have to tip toe around the unknown final page.
If we first:
- Find that last page through binary search
- Cache it
We know the stopping point and send dozens of concurrent requests to grab the data, drastically speeding things up.
async def find_last_page(base_url: str, client: httpx.AsyncClient) -> int:
low, high = 1, 1000
last_valid_page = 1
while low <= high:
mid = (low + high) // 2
url = f"{base_url}/products.json?page={mid}"
response: httpx.Response = await client.get(url)
data: ShopifyResponse = ShopifyResponse(**response.json())
if data.products:
last_valid_page = mid
low = mid + 1 # Try higher
else:
high = mid - 1 # Try lower
return last_valid_page
async def scrape_all_pages_fast(base_url: str, client: httpx.AsyncClient) -> list[ShopifyProduct]:
last_page: int = await find_last_page(base_url, client)
all_products: list[ShopifyProduct] = []
async with asyncio.TaskGroup() as tg:
tasks: list[asyncio.Task[ShopifyResponse]] = [
tg.create_task(
fetch_product(client, f"{base_url}/products.json?page={page}")
)
for page in range(1, last_page + 1)
]
for task in tasks:
all_products.extend(task.result().products)
return all_products
Here are the numbers:
Sequential Approach: We scrape page by page until we find the last one
Binary Search Approach: We find the last page with binary search, then concurrently scrape every page
Math:
I tested 10 websites at once averaging 23 pages (600 products) each.
The first time, the sequential approach took an average 3.10s while the binary approach took almost double - 5.926s.
The following runs, once that last page was cached, the binary approach scraping went 5.94s -> 0.23s, a 25.8x increase!!
In smaller runs, the investment of binary search pays off after 2 runs, while on larger tasks it wipes the floor.
Stress test
I also found a site with the max amount of products (25000, 833 pages, "https://pelacase.ca/").
| Sequential | Binary | |
|---|---|---|
| First Run Time | 120.1s | 11.56s |
| Second Run (cached) | 101.4s | 5.26s |
Important Nuance
Custom Shopify sites sometimes seem to turn this off, so look before you scrape
"Fancy algorithms are slow when n is small, and n is usually small."
- Rob Pike
Don’t prematurely optimize what you don’t need to. A simple solution to the potential mass page problem, instead of concurrent pagination + binary search, is setting the item limit of the page you’re scraping to something ridiculous. Then you only have to scrape 1 page (given many other tradeoffs).
www.zoologistperfumes.com/products.json?limit=1000000&page=1
Binary search is just another tool. You won't always need it, but knowing it exists changes what problems you can now solve efficiently when the time comes.
Conclusion
That’s it. You know:
- The backdoor
products.jsonendpoint - How to scrape multiple sites concurrently (TaskGroups, async)
- How to optimize pagination with binary search
More importantly, you learned that to optimize something, you often need to invest time illuminating it’s boundaries.
So pick a Shopify site, run the code, and watch it pull hundreds of products/second; add binary search and watch it get 25x faster.
Questions about web scraping or async Python? Drop a comment below.
Follow for more tutorials on building real production tools. :)

Top comments (0)