In this article, we'll explain how to scrape BestBuy, one of the most popular retail stores for electronic stores in the United States. We'll scrape different data types from product, search, review, and sitemap pages. Additionally, we'll employ a wide range of web scraping tricks, such as hidden JSON data, hidden APIs, HTML, and XML parsing. So, this guide serves as a comprehensive web scraping introduction!
Latest BestBuy Scraper Code
This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:
- Do not scrape at rates that could damage the website.
- Do not scrape data that's not available publicly.
- Do not store PII of EU citizens who are protected by GDPR.
- Do not repurpose the entire public datasets which can be illegal in some countries.
Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.
Why Scrape BestBuy?
The amount of data that web scraping BestBuy can allow is numerous. It can empower both businesses and retail buyers in different ways:
Competitive Analysis
The market dynamics are aggressive and fast-changing, making it challenging for businesses to remain competitive. Scraping BestBuy allows businesses to compare their competitors' pricing, sales, and reviews. This provides a better understanding of the current trends and interests to remain up-to-date and attract new customers.Customer Sentiment Analysis
BestBuy includes thousands of review data for different products. Web scraping BestBuy's reviews can be used to run sentiment analysis research, which provides useful insights into the customers' satisfaction, preferences, and feedback.Empowered Navigation
Manually browsing the excessive number of similar products on BestBuy can be tedious. On the other hand, retailers can web scrape BestBuy to compare many products quickly, allowing them to identify niche markets and undervalued products.
For further details, refer to our introduction on web scraping use cases.
Setup
To web scrape BestBuy, we'll use Python with a few community libraries:
- httpx: To request BestBuy pages and get the data as HTML, XML, or JSON.
- parsel: To parse the HTML and XML data using selectors, such as XPath and CSS.
- JMESPath: To refine and parse the BestBuy JSON datasets for the useful data only.
- loguru: To monitor and log our BestBuy scraper in beautiful terminal outputs.
- asyncio: To increase the web scraping speed by running the code asynchronously.
Since asyncio
comes pre-installed in Python, we'll only have to install the other packages using the following pip
command:
pip install httpx parsel jmespath loguru
How To Discover BestBuy Pages?
Scraping sitemaps is an efficient way to discover thousands of organized URLs. They are provided for search engine crawlers to index the pages, which we can use to discover web scraping targets on a website.
BestBuy's sitemaps can be found at bestbuy.com/robots.txt. It's a text file that provides crawling instructions along with the website's sitemap directory:
Sitemap: https://sitemaps.bestbuy.com/sitemaps_discover_learn.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_pdp.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_promos.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_qna.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_rnr.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_search_plps.xml
Sitemap: https://sitemaps.bestbuy.com/sitemaps_standalone_qa.xml
Sitemap: https://www.bestbuy.com/sitemap.xml
Each of the above sitemaps represents a group of related page URLs found under an XML file that's compressed to a gzip
file to reduce its size:
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://sitemaps.bestbuy.com/sitemaps_pdp.0000.xml.gz</loc><lastmod>2024-03-08T10:16:14.901109+00:00</lastmod></sitemap>
<sitemap><loc>https://sitemaps.bestbuy.com/sitemaps_pdp.0001.xml.gz</loc><lastmod>2024-03-08T10:16:14.901109+00:00</lastmod></sitemap>
</sitemapindex>
The above gz
file looks like the following after extracting:
<?xml version='1.0' encoding='utf-8'?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"><url><loc>https://www.bestbuy.com/site/aventon-aventure-step-over-ebike-w-45-mile-max-operating-range-and-28-mph-max-speed-medium-fire-black/6487149.p?skuId=6487149</loc></url>
<url><loc>https://www.bestbuy.com/site/detective-story-1951/34804554.p?skuId=34804554</loc></url>
<url><loc>https://www.bestbuy.com/site/flowers-lp-vinyl/35944053.p?skuId=35944053</loc></url>
<url><loc>https://www.bestbuy.com/site/apple-iphone-15-pro-max-1tb-natural-titanium-verizon/6525500.p?skuId=6525500</loc></url>
<url><loc>https://www.bestbuy.com/site/geeni-dual-outlet-outdoor-wi-fi-smart-plug-gray/6388590.p?skuId=6388590</loc></url>
<url><loc>https://www.bestbuy.com/site/dynasty-the-sixth-season-vol-1-4-discs-dvd/20139655.p?skuId=20139655</loc></url>
To scrape BestBuy's sitemaps, we'll request the compressed XML file, decode it, and parse it for the URLs. For this example, we'll use the promotions sitemap.
Python:
import asyncio
import json
import gzip
from typing import List
from httpx import AsyncClient, Response
from parsel import Selector
from loguru import logger as log
# initialize an async httpx client
client = AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
},
)
def parse_sitemaps(response: Response) -> List[str]:
"""parse links for bestbuy sitemaps"""
# decode the .gz file
print(response.text)
xml = str(gzip.decompress(response.content), 'utf-8')
selector = Selector(xml)
data = []
for url in selector.xpath("//url/loc/text()"):
data.append(url.get())
return data
async def scrape_sitemaps(url: str) -> List[str]:
"""scrape link data from bestbuy sitemaps"""
response = await client.get(url)
promo_urls = parse_sitemaps(response)
log.success(f"scraped {len(promo_urls)} urls from sitemaps")
return promo_urls
ScrapFly:
import asyncio
import json
import gzip
from typing import List
from parsel import Selector
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_sitemaps(response: ScrapeApiResponse) -> List[str]:
"""parse links for bestbuy sitemaps"""
# decode the .gz file
bytes_data = response.scrape_result['content'].getvalue()
xml = str(gzip.decompress(bytes_data), 'utf-8')
selector = Selector(xml)
data = []
for url in selector.xpath("//url/loc/text()"):
data.append(url.get())
return data
async def scrape_sitemaps(url: str) -> List[str]:
"""scrape link data from bestbuy sitemaps"""
response = await SCRAPFLY.async_scrape(ScrapeConfig(url, country="US",))
promo_urls = parse_sitemaps(response)
log.success(f"scraped {len(promo_urls)} urls from sitemaps")
return promo_urls
Run the code:
async def run():
promo_urls = await scrape_sitemaps(
url="https://sitemaps.bestbuy.com/sitemaps_promos.0000.xml.gz"
)
# save the data to a JSON file
with open("promos.json", "w", encoding="utf-8") as file:
json.dump(promo_urls, file, indent=2, ensure_ascii=False)
if __name__ == " __main__":
asyncio.run(run())
In the above code, we define an httpx
with common browser headers to minimize the chances of getting blocked. Additionally, we define two functions, let's break them down:
-
scrape_sitemaps
: To request the sitemap URL using the definedhttpx
client. -
parse_sitemaps
: To decode thegz
file into its XML content and then parse the XML for the URLs using the XPath selector.
Here is a sample output of the results we got:
[
"https://www.bestbuy.com/site/promo/4k-capable-memory-cards",
"https://www.bestbuy.com/site/promo/all-total-by-verizon",
"https://www.bestbuy.com/site/promo/shop-featured-intel-evo",
"https://www.bestbuy.com/site/promo/laser-heat-therapy",
"https://www.bestbuy.com/site/promo/save-on-select-grills",
....
]
For further details on scraping and discovering sitemaps, refer to our dedicated guide.
How To Scrape BestBuy Search Pages?
Let's start with the first part of our BestBuy scraper code: search pages. Search for any product on the website, like the "macbook" keyword, and you will get a page that looks the following:
To scrape BestBuy search pages, we'll request the search page URL and then parse the HTML. First, let's with the parsing logic.
Python:
def parse_search(response: Response) -> List[Dict]:
"""parse search data from search pages"""
selector = Selector(response.text)
data = []
for item in selector.xpath("//ol[@class='sku-item-list']/li[@class='sku-item']"):
name = item.xpath(".//h4[@class='sku-title']/a/text()").get()
link = item.xpath(".//h4[@class='sku-title']/a/@href").get()
price = item.xpath(".//div[@data-testid='customer-price']/span/text()").get()
price = int(price[price.index("$") + 1:].replace(",", "").replace(".", "")) // 100 if price else None
original_price = item.xpath(".//div[@data-testid='regular-price']/span/text()").get()
original_price = int(original_price[original_price.index("$") + 1:].replace(",", "").replace(".", "")) // 100 if original_price else None
sku = item.xpath(".//div[@class='sku-model']/div[2]/span[@class='sku-value']/text()").get()
model = item.xpath(".//div[@class='sku-model']/div[1]/span[@class='sku-value']/text()").get()
rating = item.xpath(".//p[contains(text(),'out of 5')]/text()").get()
rating_count = item.xpath(".//span[contains(@class,'c-reviews')]/text()").get()
is_sold_out = bool(item.xpath(".//strong[text()='Sold Out']").get())
image = item.xpath(".//img[contains(@class,'product-image')]/@src").get()
data.append({
"name": name,
"link": "https://www.bestbuy.com" + link,
"image": image,
"sku": sku,
"model": model,
"price": price,
"original_price": original_price,
"save": f"{round((1 - price / original_price) * 100, 2):.2f}%" if price and original_price else None,
"rating": float(rating[rating.index(" "):rating.index(" out")].strip()) if rating else None,
"rating_count": int(rating_count.replace("(", "").replace(")", "").replace(",", "")) if rating_count and rating_count != "Not Yet Reviewed" else None,
"is_sold_out": is_sold_out,
})
total_count = selector.xpath("//span[@class='item-count']/text()").get()
total_count = int(total_count.split(" ")[0]) // 18 # convert the total items to pages, 18 items in each page
return {"data": data, "total_count": total_count}
ScrapFly
def parse_search(response: ScrapeApiResponse) -> List[Dict]:
"""parse search data from search pages"""
selector = response.selector
data = []
for item in selector.xpath("//ol[@class='sku-item-list']/li[@class='sku-item']"):
name = item.xpath(".//h4[@class='sku-title']/a/text()").get()
link = item.xpath(".//h4[@class='sku-title']/a/@href").get()
price = item.xpath(".//div[@data-testid='customer-price']/span/text()").get()
price = int(price[price.index("$") + 1:].replace(",", "").replace(".", "")) // 100 if price else None
original_price = item.xpath(".//div[@data-testid='regular-price']/span/text()").get()
original_price = int(original_price[original_price.index("$") + 1:].replace(",", "").replace(".", "")) // 100 if original_price else None
sku = item.xpath(".//div[@class='sku-model']/div[2]/span[@class='sku-value']/text()").get()
model = item.xpath(".//div[@class='sku-model']/div[1]/span[@class='sku-value']/text()").get()
rating = item.xpath(".//p[contains(text(),'out of 5')]/text()").get()
rating_count = item.xpath(".//span[contains(@class,'c-reviews')]/text()").get()
is_sold_out = bool(item.xpath(".//strong[text()='Sold Out']").get())
image = item.xpath(".//img[contains(@class,'product-image')]/@src").get()
data.append({
"name": name,
"link": "https://www.bestbuy.com" + link,
"image": image,
"sku": sku,
"model": model,
"price": price,
"original_price": original_price,
"save": f"{round((1 - price / original_price) * 100, 2):.2f}%" if price and original_price else None,
"rating": float(rating[rating.index(" "):rating.index(" out")].strip()) if rating else None,
"rating_count": int(rating_count.replace("(", "").replace(")", "").replace(",", "")) if rating_count and rating_count != "Not Yet Reviewed" else None,
"is_sold_out": is_sold_out,
})
total_count = selector.xpath("//span[@class='item-count']/text()").get()
total_count = int(total_count.split(" ")[0]) // 18 # convert the total items to pages, 18 items in each page
return {"data": data, "total_count": total_count}
Here, we define a parse_search
function, which does the following:
- Iterates over the product boxes on the HTML.
- Parses each product's data, such as the name, price, link, etc.
- Gets the total number of search pages available and returns the search data.
Next, we'll utilize the above parsing logic while sending requests to scrape and crawl the search pages.
Python:
import asyncio
import json
import urllib.parse
from typing import List, Dict, Union
from httpx import AsyncClient, Response
from parsel import Selector
from urllib.parse import urlencode, quote_plus
from loguru import logger as log
# initialize an async httpx client
client = AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Cookie": "intl_splash=false"
},
)
def parse_search(response: Response):
"""parse search data from search pages"""
# rest of the function logic
async def scrape_search(
search_query: str, sort: Union["-bestsellingsort", "-Best-Discount"] = None, max_pages=None
) -> List[Dict]:
"""scrape search data from bestbuy search"""
def form_search_url(page_number: int):
"""form the search url"""
base_url = "https://www.bestbuy.com/site/searchpage.jsp?"
# search parameters
params = {
"st": quote_plus(search_query),
"sp": sort, # None = best match
"cp": page_number
}
return base_url + urlencode(params)
first_page = await client.get(form_search_url(1))
data = parse_search(first_page)
search_data = data["data"]
total_count = data["total_count"]
# get the number of total search pages to scrape
if max_pages and max_pages < total_count:
total_count = max_pages
log.info(f"scraping search pagination, {total_count - 1} more pages")
# add the remaining pages to a scraping list to scrape them concurrently
to_scrape = [
client.get(form_search_url(page_number))
for page_number in range(2, total_count + 1)
]
for response in asyncio.as_completed(to_scrape):
response = await response
data = parse_search(response)["data"]
search_data.extend(data)
log.success(f"scraped {len(search_data)} products from search pages")
return search_data
ScrapFly
import asyncio
import json
from typing import Dict, List, Union
from urllib.parse import urlencode, quote_plus
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_search(response: ScrapeApiResponse) -> List[Dict]:
"""parse search data from search pages"""
# rest of the function logic
async def scrape_search(
search_query: str, sort: Union["-bestsellingsort", "-Best-Discount"] = None, max_pages=None
) -> List[Dict]:
"""scrape search data from bestbuy search"""
def form_search_url(page_number: int):
"""form the search url"""
base_url = "https://www.bestbuy.com/site/searchpage.jsp?"
# search parameters
params = {
"st": quote_plus(search_query),
"sp": sort, # None = best match
"cp": page_number
}
return base_url + urlencode(params)
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(form_search_url(1), country="US", asp=True))
data = parse_search(first_page)
search_data = data["data"]
total_count = data["total_count"]
# get the number of total search pages to scrape
if max_pages and max_pages < total_count:
total_count = max_pages
log.info(f"scraping search pagination, {total_count - 1} more pages")
# add the remaining pages to a scraping list to scrape them concurrently
to_scrape = [
ScrapeConfig(form_search_url(page_number), country="US", asp=True)
for page_number in range(2, total_count + 1)
]
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
data = parse_search(response)["data"]
search_data.extend(data)
log.success(f"scraped {len(search_data)} products from search pages")
return search_data
Run the code:
async def run():
search_data = await scrape_search(
search_query="macbook",
max_pages=3
)
# save the results to a JSOn file
with open("search.json", "w", encoding="utf-8") as file:
json.dump(search_data, file, indent=2, ensure_ascii=False)
if __name__ == " __main__":
asyncio.run(run())
Let's break down the execution flow of the above scrape_search
function:
- Form a search URL based on the search keyword, sorting option, and page number.
- Request the search URL and parse it with the
parse_search
function. - Get the number of pagination pages to scrape using the
max_pages
parameter. - Add the remaining pagination URLs to a list and request them concurrently.
The above BestBuy scraping code will extract product data from three search pages. Here is what the results should look like:
[
{
"name": "MacBook Pro 13.3\" Laptop - Apple M2 chip - 24GB Memory - 1TB SSD (Latest Model) - Silver",
"link": "https://www.bestbuy.com/site/macbook-pro-13-3-laptop-apple-m2-chip-24gb-memory-1tb-ssd-latest-model-silver/6382795.p?skuId=6382795",
"image": "https://pisces.bbystatic.com/image2/BestBuy_US/images/products/6382/6382795_sd.jpg;maxHeight=200;maxWidth=300",
"sku": "6382795",
"model": "MNEX3LL/A",
"price": 1499,
"original_price": 2099,
"save": "28.59%",
"rating": 4.8,
"rating_count": 4,
"is_sold_out": false
},
....
]
The above code can scrape the product data that is visible on the search pages. However, it can be extended with crawling logic to scrape the full details of each product from its respective URL. For further details on crawling while scraping, refer to our dedicated guide.
How To Scrape BestBuy Product Pages?
Let's add support for scraping product pages to our BestBuy scraper. Before we start, let's have a look at what product pages look like. Go to any product page on the website, like this one, and you will get a page similar to this:
Data on product pages is comprehensive, and it's scattered across the page. Therefore, it's challenging to scrape it using selectors. Instead, we'll scrape them as JSON datasets from script tags. To locate these script tags, follow the below steps:
- Open the browser developer tools by pressing the
F12
key. - Search for the script tags using the selector
//script[@type='application/json']
.
After following the above steps, you will find several script tags that include JSON data. However, we are only interested in a few of them:
The above JSON data are the same on the page but before getting rendered into the HTML, which is often known as hidden web data.
To scrape the product data, we will select the script tags containing the JSON data and parse them.
Python:
import jmespath
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from parsel import Selector
from loguru import logger as log
# initialize an async httpx client
client = AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Cookie": "intl_splash=false"
},
)
def refine_product(data: Dict) -> Dict:
"""refine the JSON product data"""
parsed_product = {}
specifications = data["shop-specifications"]["specifications"]["categories"]
pricing = data["pricing"]["app"]["data"]["skuPriceDomain"]
ratings = jmespath.search(
"""{
featureRatings: aggregateSecondaryRatings,
positiveFeatures: distillation.positiveFeatures[].{name: name, score: representativeQuote.score, totalReviewCount: totalReviewCount},
negativeFeatures: distillation.negativeFeatures[].{name: name, score: representativeQuote.score, totalReviewCount: totalReviewCount}
}""",
data["reviews"]["app"],
)
faqs = []
for item in data["faqs"]["app"]["questions"]["results"]:
result = jmespath.search(
"""{
sku: sku,
questionTitle: questionTitle,
answersForQuestion: answersForQuestion[].answerText
}""",
item,
)
faqs.append(result)
# define the final parsed product
parsed_product["specifications"] = specifications
parsed_product["pricing"] = pricing
parsed_product["ratings"] = ratings
parsed_product["faqs"] = faqs
return parsed_product
def parse_product(response: Response) -> Dict:
"""parse product data from bestbuy product pages"""
selector = Selector(response.text)
# print(response.text)
data = {}
data["shop-specifications"] = json.loads(selector.xpath("//script[contains(@id, 'shop-specifications')]/text()").get())
data["faqs"] = json.loads(selector.xpath("//script[contains(@id, 'content-question')]/text()").get())
data["pricing"] = json.loads(selector.xpath("//script[contains(@id, 'pricing-price')]/text()").get())
data["reviews"] = json.loads(selector.xpath("//script[contains(@id, 'ratings-and-reviews')]/text()").get())
parsed_product = refine_product(data)
return parsed_product
async def scrape_products(urls: List[str]) -> List[Dict]:
"""scrapy product data from bestbuy product pages"""
to_scrape = [client.get(url) for url in urls]
data = []
for response in asyncio.as_completed(to_scrape):
response = await response
product_data = parse_product(response)
data.append(product_data)
log.success(f"scraped {len(data)} products from product pages")
return data
ScrapFly
import json
import jmespath
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def refine_product(data: Dict) -> Dict:
"""refine the JSON product data"""
parsed_product = {}
specifications = data["shop-specifications"]["specifications"]["categories"]
pricing = data["pricing"]["app"]["data"]["skuPriceDomain"]
ratings = jmespath.search(
"""{
featureRatings: aggregateSecondaryRatings,
positiveFeatures: distillation.positiveFeatures[].{name: name, score: representativeQuote.score, totalReviewCount: totalReviewCount},
negativeFeatures: distillation.negativeFeatures[].{name: name, score: representativeQuote.score, totalReviewCount: totalReviewCount}
}""",
data["reviews"]["app"],
)
faqs = []
for item in data["faqs"]["app"]["questions"]["results"]:
result = jmespath.search(
"""{
sku: sku,
questionTitle: questionTitle,
answersForQuestion: answersForQuestion[].answerText
}""",
item,
)
faqs.append(result)
# define the final parsed product
parsed_product["specifications"] = specifications
parsed_product["pricing"] = pricing
parsed_product["ratings"] = ratings
parsed_product["faqs"] = faqs
return parsed_product
def parse_product(response: ScrapeApiResponse) -> Dict:
"""parse product data from bestbuy product pages"""
selector = response.selector
data = {}
data["shop-specifications"] = json.loads(selector.xpath("//script[contains(@id, 'shop-specifications')]/text()").get())
data["faqs"] = json.loads(selector.xpath("//script[contains(@id, 'content-question')]/text()").get())
data["pricing"] = json.loads(selector.xpath("//script[contains(@id, 'pricing-price')]/text()").get())
data["reviews"] = json.loads(selector.xpath("//script[contains(@id, 'ratings-and-reviews')]/text()").get())
parsed_product = refine_product(data)
return parsed_product
async def scrape_products(urls: List[str]) -> List[Dict]:
"""scrapy product data from bestbuy product pages"""
to_scrape = [ScrapeConfig(url, country="US", asp=True) for url in urls]
data = []
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
product_data = parse_product(response)
data.append(product_data)
log.success(f"scraped {len(data)} products from product pages")
return data
Run the code:
async def run():
data = await scrape_products(
urls=[
"https://www.bestbuy.com/site/macbook-air-13-3-laptop-apple-m1-chip-8gb-memory-256gb-ssd-gold-gold/6418599.p",
"https://www.bestbuy.com/site/apple-macbook-air-15-laptop-m2-chip-8gb-memory-256gb-ssd-midnight/6534606.p",
"https://www.bestbuy.com/site/macbook-pro-13-3-laptop-apple-m2-chip-8gb-memory-256gb-ssd-latest-model-silver/6509654.p"
]
)
with open("product.json", "w", encoding="utf-8") as file:
json.dump(data, file, indent=2, ensure_ascii=False)
if __name__ == " __main__":
asyncio.run(run())
Let's break down the above BestBuy scraping code:
-
refine_product
: It refines the product JSON datasets with JMESPath to exclude the unnecessary and keep the useful ones. -
parse_product
: To parse the product hidden JSON data from the HTML with XPath. -
scrape_products
: To request the product page URLs concurrently and parse the HTML output with theparse_product
function.
The output is a comprehensive JSON dataset that looks like the following:
[
{
"specifications": [
{
"displayName": "Key Specs",
"specifications": [
{
"displayName": "Screen Size",
"value": "13.3 inches",
"definition": "Size of the screen, measured diagonally from corner to corner.",
"id": "TQqJBgOyVv"
}
....
]
},
....
],
"pricing": {
"skuId": "6418599",
"regularPrice": 999.99,
"currentPrice": 999.99,
"priceEventType": "regular",
"totalSavings": 0,
"totalSavingsPercent": 0,
"totalPaidMemberSavings": 0,
"totalNonPaidMemberSavings": 0,
"customerPrice": 999.99,
"isMAP": false,
"isPriceMatchGuarantee": true,
"offerQualifiers": [
{
"offerId": "634974",
"offerName": "Apple - Apple Music 3 Month Trial GWP",
"offerVersion": 662398,
"offerDiscountType": "Free",
"id": 634974002,
"comOfferType": "FREEITEM",
"comRuleType": "10",
"instanceId": 5,
"offerRevocableOnReturns": true,
"excludeFromBundleBreakage": false
},
....
],
"giftSkus": [
{
"skuId": "6484511",
"quantity": 1,
"offerId": "465099",
"savings": 0,
"isRequiredWithOffer": false
},
....
],
"totalGiftSavings": 0,
"gspUnitPrice": 999.99,
"financeOption": {
"offerId": "384913",
"financeCodeName": "12-Month Financing",
"financeCode": 7,
"rank": 8,
"financeTerm": 12,
"monthlyPayment": 83.34,
"monthlyPaymentIncludingTax": 83.34,
"defaultPlan": true,
"priority": 1,
"planType": "Deferred",
"rate": 0,
"totalCost": 999.99,
"termsAndConditions": "NO INTEREST IF PAID IN FULL WITHIN 12 MONTHS. If the deferred interest balance is not paid in full by the end of the promotional period, interest will be charged from the purchase date at rates otherwise applicable under your Card Agreement. Min. payments required. See Card Agreement for details.",
"totalCostIncludingTax": 999.99,
"financeCodeDescLong": "No interest if paid in full within 12 months (no points)"
},
"financeOptions": [
{
"offerId": "384913",
"financeCodeName": "12-Month Financing",
"financeCode": 7,
"rank": 8,
"financeTerm": 12,
"monthlyPayment": 83.34,
"monthlyPaymentIncludingTax": 83.34,
"defaultPlan": true,
"priority": 1,
"planType": "Deferred",
"rate": 0,
"totalCost": 999.99,
"termsAndConditions": "NO INTEREST IF PAID IN FULL WITHIN 12 MONTHS. If the deferred interest balance is not paid in full by the end of the promotional period, interest will be charged from the purchase date at rates otherwise applicable under your Card Agreement. Min. payments required. See Card Agreement for details.",
"totalCostIncludingTax": 999.99,
"financeCodeDescLong": "No interest if paid in full within 12 months (no points)"
}
],
....
},
"ratings": {
"featureRatings": [
{
"attribute": "BatteryLife",
"attributeLabel": "Battery Life",
"avg": 4.856636035826451,
"count": 17194
},
....
],
"positiveFeatures": [
{
"name": "Speed",
"score": 4,
"totalReviewCount": 2386
},
....
],
"negativeFeatures": [
{
"name": "Touch screen",
"score": 16,
"totalReviewCount": 168
},
....
]
},
"faqs": [
{
"sku": "6418599",
"questionTitle": "Does this MacBook have a built-in HDMI port?",
"answersForQuestion": [
"No. It has 2 Thunderbolt 3 ports that you can get an adapter for to give you HDMI.",
"No. However, you can connect your MacBook Air to HDMI using the a USB-C Digital AV Multiport Adapter. (sold separately)",
"I am afraid not for Mac book air and pro m1 2020 it has only the thunderbolts 2 points"
]
},
....
]
}
]
๐โ Note that the HTML structure of the BestBuy product pages differs based on product type and category. Therefore, the above product parsing logic should be adjusted for other product types.
Cool! The above BestBuy scraping code can extract the full details of each product. However, it lacks the product reviews - let's scrape them in the next section!
How to Scrape BestBuy Review Pages?
Reviews on BestBuy can be found on each product page:
The above review data are split into two categories:
Product ratings
Review and rating data into each product's specification, which we scraped earlier from the product page itself.User reviews
Detailed user reviews of the product, which we'll scrape in this section.
To scrape BestBuy reviews, we'll utilize the hidden reviews API. To locate this API, follow the below steps:
- Open the browser developer tools by pressing the
F12
key. - Select the
network
tab and filter byFetch/XHR
requests. - Filter the review using the sort option or click on the next review page.
After following the above steps, you will find the reviews API recorded on the browser:
The API above is called in the background using the browser and then rendered into HTML. The request can be copied as a cURL and imported into HTTP clients like Postman.
To scrape the product reviews, we'll request the above API and paginate it.
Python:
import asyncio
import json
from typing import List, Dict
from httpx import AsyncClient, Response
from loguru import logger as log
# initialize an async httpx client
client = AsyncClient(
# enable http2
http2=True,
# add basic browser like headers to prevent getting blocked
headers={
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Cookie": "intl_splash=false"
},
)
def parse_reviews(response: Response) -> List[Dict]:
"""parse review data from the review API responses"""
data = json.loads(response.text)
total_count = data["totalPages"]
review_data = data["topics"]
return {"data": review_data, "total_count": total_count}
async def scrape_reviews(skuid: int, max_pages: int=None) -> List[Dict]:
"""scrape review data from the reviews API"""
first_page = await client.get(f"https://www.bestbuy.com/ugc/v2/reviews?page=1&pageSize=20&sku={skuid}&sort=MOST_RECENT")
data = parse_reviews(first_page)
review_data = data["data"]
total_count = data["total_count"]
# get the number of total review pages to scrape
if max_pages and max_pages < total_count:
total_count = max_pages
log.info(f"scraping reviews pagination, {total_count - 1} more pages")
# add the remaining pages to a scraping list to scrape them concurrently
to_scrape = [
client.get(f"https://www.bestbuy.com/ugc/v2/reviews?page={page_number}&pageSize=20&sku={skuid}&sort=MOST_RECENT")
for page_number in range(2, total_count + 1)
]
for response in asyncio.as_completed(to_scrape):
response = await response
data = parse_reviews(response)["data"]
review_data.extend(data)
log.success(f"scraped {len(review_data)} reviews from the reviews API")
return review_data
Python:
import asyncio
import json
from typing import Dict, List
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
SCRAPFLY = ScrapflyClient(key="Your ScrapFly API key")
def parse_reviews(response: ScrapeApiResponse) -> List[Dict]:
"""parse review data from the review API responses"""
data = json.loads(response.scrape_result['content'])
total_count = data["totalPages"]
review_data = data["topics"]
return {"data": review_data, "total_count": total_count}
async def scrape_reviews(skuid: int, max_pages: int=None) -> List[Dict]:
"""scrape review data from the reviews API"""
first_page = await SCRAPFLY.async_scrape(ScrapeConfig(
f"https://www.bestbuy.com/ugc/v2/reviews?page=1&pageSize=20&sku={skuid}&sort=MOST_RECENT",
asp=True, country="US"
))
data = parse_reviews(first_page)
review_data = data["data"]
total_count = data["total_count"]
# get the number of total review pages to scrape
if max_pages and max_pages < total_count:
total_count = max_pages
log.info(f"scraping reviews pagination, {total_count - 1} more pages")
# add the remaining pages to a scraping list to scrape them concurrently
to_scrape = [
ScrapeConfig(
f"https://www.bestbuy.com/ugc/v2/reviews?page={page_number}&pageSize=20&sku={skuid}&sort=MOST_RECENT",
asp=True, country="US"
)
for page_number in range(2, total_count + 1)
]
async for response in SCRAPFLY.concurrent_scrape(to_scrape):
data = parse_reviews(response)["data"]
review_data.extend(data)
log.success(f"scraped {len(review_data)} reviews from the reviews API")
return review_data
The above part of our BestBuy scraper is fairly straightforward. We only use two functions:
-
scrape_reviews
: For requesting the reviews API, which accepts product skuID, sorting option, and page number. It starts by requesting the first page and then adding the remaining API URLs to a scraping list to request them concurrently. -
parse_reviews
: For parsing the JSON response of the reviews API. The response contains various review data types, but the function only parses the user reviews.
Here is a sample output of the above BestBuy scraping code:
[
{
"id": "6b88383f-3830-3c78-915c-d3cf9f16596d",
"topicType": "review",
"rating": 5,
"recommended": true,
"title": "Amazing!",
"text": "An absolutly amazing console very fast and smooth.",
"author": "CocaNoot",
"positiveFeedbackCount": 0,
"negativeFeedbackCount": 0,
"commentCount": 0,
"writeCommentUrl": "/site/reviews/submission/6565065/review/337294210?campaignid=RR_&return=",
"submissionTime": "2024-03-02T10:52:07.000-06:00",
"brandResponses": [],
"badges": [
{
"badgeCode": "Incentivized",
"badgeDescription": "This reviewer received promo considerations or sweepstakes entry for writing a review.",
"badgeName": "Incentivized",
"badgeType": "Custom",
"fileName": null,
"iconText": null,
"iconPath": null,
"index": 90900
},
{
"badgeCode": "VerifiedPurchaser",
"badgeDescription": "Weโve verified that this content was written by people who purchased this item at Best Buy.",
"badgeName": "Verified Purchaser",
"badgeType": "Custom",
"fileName": "badgeContextual-verifiedPurchaser.jpg",
"imageURL": "https://bestbuy.ugc.bazaarvoice.com/static/3545w/badgeContextual-verifiedPurchaser.jpg",
"iconText": "Verified Purchase",
"iconPath": "/ugc-raas/ugc-common-assets/ugc-badge-verified-check.svg",
"index": 100000,
"iconUrl": "https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/ugc-badge-verified-check.svg"
},
{
"badgeCode": "rewardZoneNumberV3",
"badgeDescription": "My Best Buy members receive promotional considerations or entries into drawings for writing reviews.",
"badgeName": "My Best Buy\\u00ae Member",
"badgeType": "Custom",
"fileName": "badgeRewardZoneStd.gif",
"imageURL": "https://bestbuy.ugc.bazaarvoice.com/static/3545w/badgeRewardZoneStd.gif",
"iconText": "",
"iconPath": "/ugc-raas/ugc-common-assets/badge-my-bestbuy-core.svg",
"index": 100500,
"iconUrl": "https://www.bestbuy.com/~assets/bby/_com/ugc-raas/ugc-common-assets/badge-my-bestbuy-core.svg"
}
],
"photos": [
{
"photoId": "008b1a1e-ba1b-38ea-b86e-effb7c0ca162",
"caption": null,
"normalUrl": "https://photos-us.bazaarvoice.com/photo/2/cGhvdG86YmVzdGJ1eQ/e79a5ff1-e891-57fa-ae03-e9f52bb4d7c4",
"piscesUrl": "https://pisces.bbystatic.com/image2/BestBuy_US/ugc/photos/thumbnail/8db68b60f7a60bcea8f6cd1470938da9.jpg",
"thumbnailUrl": "https://photos-us.bazaarvoice.com/photo/2/cGhvdG86YmVzdGJ1eQ/bd287ee8-1c8b-52ae-9c12-4a379d7ecb24",
"reviewId": "6b88383f-3830-3c78-915c-d3cf9f16596d"
}
],
"qualityRating": null,
"valueRating": null,
"easeOfUseRating": null,
"daysOfOwnership": 70,
"pros": null,
"cons": null,
"secondaryRatings": [
{
"attribute": "Performance",
"value": 5,
"attributeLabel": "Performance",
"valueLabel": "Excellent"
},
{
"attribute": "StorageCapacity",
"value": 5,
"attributeLabel": "Storage Capacity",
"valueLabel": "Excellent"
},
{
"attribute": "Controller",
"value": 5,
"attributeLabel": "Controller",
"valueLabel": "Excellent"
}
]
},
....
]
With this last feature, our BestBuy scraper is complete. It can scrape sitemaps, search, product, and review data.
Avoid BestBuy Scraping Blocking
We have successfully scraped BestBuy data from various pages. However, attempting to scale our scraping rate will lead the website to block the IP address. For this, we'll use ScrapFly, a web scraping API that allows for scraping at scale by providing:
- Residential proxies in over 50 countries - For scraping from almost any geographical location while also preventing IP address throttling and blocking.
- JavaScript rendering - For scraping dynamic web pages through cloud headless browsers wihtout running them yourself.
- Easy to use Python and Typescript SDKs, as well as Scrapy integration.
- And much more!
ScrapFly service does the heavy lifting for you!
Here is how we can scrape without getting blocked with ScrapFly. All we have to do is replace the HTTP client with the ScrapFly client, enable the asp
parameter, and select a proxy country:
# standard web scraping code
import httpx
from parsel import Selector
response = httpx.get("some bestbuy.com URL")
selector = Selector(response.text)
# in ScrapFly becomes this ๐
from scrapfly import ScrapeConfig, ScrapflyClient
# replaces your HTTP client (httpx in this case)
scrapfly = ScrapflyClient(key="Your ScrapFly API key")
response = scrapfly.scrape(ScrapeConfig(
url="website URL",
asp=True, # enable the anti scraping protection to bypass blocking
country="US", # set the proxy location to a specfic country
render_js=True # enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
))
# use the built in Parsel selector
selector = response.selector
# access the HTML content
html = response.scrape_result['content']
FAQ
To wrap up this guide on web scraping BestBuy, let's have a look at some frequently asked questions.
Are there public APIs for BestBuy?
Yes, BestBuy offers APIs for developers. We have scraped review data from hidden BestBuy APIs. The same approach can be utilized to scrape other data sources on the website.
Are there alternatives for scraping BestBuy?
Yes, other popular e-commerce platforms include Amazon and Walmart. We have covered scraping Amazon and Walmart in previous tutorials. For more guides on similar scraping targets, refer to our #scrapeguide blog tag.
Latest BestBuy Scraper Code
Summary
In this guide, we have explained how to scrape BestBuy. We went through a step-by-step guide on scraping BestBuy with Python for different pages on the website, which are:
- Sitemaps for BestBuy page URLs.
- Search pages for product data on search results.
- Product pages for various details, including specifications, pricing, and ratings.
- Review pages for user reviews on products.
Top comments (0)