Introduction
What is Web Scraping?
Web scraping is the process of automatically collecting data from websites using code. Instead of copying and pasting data by hand, a program visits a webpage, reads its content, and extracts exactly what you need. This is super helpful for gathering a lot of information very quickly. For this project, we will be scraping an e-commerce website called Lazada.
By the end of this article, you will learn how to:
- Understand how web scraping works on dynamic websites.
- Use Selenium to load and control a web browser.
- Extract specific product information using BeautifulSoup.
- Save your cleanly structured data into a JSON file.
Why do we need Selenium and BeautifulSoup?
Many modern websites, like Lazada, load products dynamically. This means the product details don't exist in the page's basic HTML code right away, they pop up a few seconds later using JavaScript. Because of this, simple scraping tools (like the requests library) won't work.
To fix this, we use Selenium. Selenium acts like a real user by opening a browser and waiting for the products to fully load. Once the page is ready, we use a tool called BeautifulSoup to read the HTML and grab the exact product details we want.
Prerequisites
To follow along with this project, you will need:
- A basic understanding of Python.
- A Code Editor (like VS Code or PyCharm).
Step 1: Create a virtual environment
It is always best to keep your project files organized. Open your terminal and create one using:
python -m venv venv
Activate it with:
(Windows)
source venv\Scripts\activate
(Mac/Linux)
source venv/bin/activate
Step 2: Install the required libraries
Make sure you have Selenium and BeautifulSoup installed by typing these commands into your terminal:
pip install selenium
pip install beautifulsoup4
Alright, let’s do this noh!
Before we write our scraper, we need a plan. We are going to collect basic product details from Lazada's search results page. We want to grab the following for each product:
- Product Name
- Price
- Number of items sold
- Number of reviews
- Seller location
- Link to the product
All of this information is visible right on the main search cards, so we don't even need to click into each product individually.
To extract these details, we need to identify specific elements in the HTML that contain the data we want. You can use your browser’s Inspect tool to see the HTML structure.
Building the Scraper Step-by-Step
1. Getting User Input
First, our program needs to ask the user what they want to search for and how many pages they want to scrape.
search_item = input("Enter a product: ")
max_pages = int(input("Enter how many pages: "))
The script uses the product name to create the web address. For example, if you type "laptop", the URL becomes https://www.lazada.com.ph/tag/laptop. We will also use a loop to visit multiple pages automatically based on the number you type in. Each page displays 40 items of the selected product.
2. Setting Up Selenium
Next, we tell Selenium to open Google Chrome, but we want it to be sneaky!
options = Options()
options.add_argument("--headless=new")
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get(url)
- Headless mode: This tells the browser to run invisibly in the background. You won't actually see a window pop up, which makes the code run faster.
- AutomationControlled: Disabling this feature helps hide the fact that a bot is controlling the browser, reducing the chance that Lazada will block us.
3. Waiting for Products to Load
This is the most important step for dynamic websites! The program pauses and waits up to 10 seconds until the product container with a class name Bm3ON appears on the screen. If we don't wait, the scraper might try to read the page before the products actually show up.
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.Bm3ON"))
)
html = driver.page_source
Once the page is fully loaded, BeautifulSoup reads this messy HTML and turns it into a neat structure that our code can easily search through to find the product containers.
4. Extracting the Data
Now we loop through every single product container on the page and pull out the details.
name = None
link = None
anchor = item.select_one("a[title]")
if anchor:
name = anchor.get("title")
href = anchor.get("href")
if href:
link = href if href.startswith("http") else "https:" + href
price_element = item.select_one("span.ooOxS")
price = price_element.get_text(strip=True) if price_element else None
sold_element = item.select_one("span._1cEkb")
sold = sold_element.get_text(strip=True) if sold_element else None
reviews_element = item.select_one("span.qzqFw")
reviews = reviews_element.get_text(strip=True) if reviews_element else None
location_element = item.select_one("span.oa6ri")
location = location_element.get_text(strip=True) if location_element else None
products.append({
"name": name,
"price": price,
"sold": sold,
"reviews": reviews,
"location": location,
"link": link,
})
- We use CSS selectors to pinpoint the exact text we want.
- We use
ifstatements as a safety net. If a product is missing a review count or a location, the program just labels it asNoneinstead of crashing. - Finally, it saves all these details into a list.
5. Saving Data to JSON and Closing Up
After gathering all the data across all the pages, we save it into a file and shut down the robot browser.
- Saving: The data is exported as a JSON file.
with open(f"data/{filename}.json", "w", encoding="utf-8") as f:
json.dump(data, f, indent=4, ensure_ascii=False)
- Closing: Selenium safely closes the hidden Chrome browser so it doesn't waste your computer's memory.
driver.quit()
And that's it! Your web scraper is done.
How to Use the Scraper
The program will ask you what you want to search for and how many pages to scrape. Type your answers and hit Enter!
Enter a product: mechanical keyboard
Enter how many pages: 2
Scraping...
Saved 80 products to mechanical keyboard.json
Sample Output
Because we saved our extracted data as a JSON file, it is highly structured and easy to read. If you open your new mechanical keyboard.json file, you will see a neatly organized list of dictionaries that looks exactly like this:
[
{
"name": "RGB Mechanical Gaming Keyboard Blue Switch",
"price": "₱1,250",
"sold": "54 Sold",
"reviews": "12 Reviews",
"location": "Metro Manila",
"link": "https://www.lazada.com.ph/products/example-link"
},
{
"name": "Wireless 65% Mechanical Keyboard Hot-Swappable",
"price": "₱2,100",
"sold": "850 Sold",
"reviews": "315 Reviews",
"location": "Overseas",
"link": "https://www.lazada.com.ph/products/example-link-2"
}
]
Summary
- Takes user input for the product name and page count.
- Opens a hidden browser using Selenium.
- Waits patiently for the dynamic products to appear.
- Extracts specific details (price, name, sold count) using BeautifulSoup.
- Saves everything neatly into a JSON file.
- Closes the browser safely.
Conclusion
Congratulations! You now understand the core mechanics of scraping dynamic e-commerce websites. By combining Selenium's ability to automate a real web browser with BeautifulSoup's efficient data parsing, you can bypass the limitations of basic HTML scraping and gather valuable data from modern, JavaScript-heavy sites like Lazada.
As a final reminder, E-commerce sites frequently update their layouts and class names (like div.Bm3ON), so if your script ever returns empty data in the future, just inspect the webpage again and update your CSS selectors. Always scrape responsibly, avoid sending too many requests too quickly, and happy coding!
Top comments (0)