datacollection

Posted on Jan 26

Web Crawler in Python: A Step-by-Step Guide for 2025

With the dramatic increase in the amount of data, Web Crawling has become a tool in fields such as data science, market research, and competitive analysis. Among the cohort programming languages, Python has become the preferred language for developing web crawlers python (Python web crawlers) with its concise syntax and powerful library support. Whether it is extracting data from an e-commerce platform or collecting the latest articles from a news website, web crawlers python can complete the task efficiently. This article will provide you with a 2025 version of a step-by-step guide to help you master how to use Python to build a powerful and powerful web crawler, from basic knowledge to advanced techniques, to comprehensively improve your web crawling capabilities.

What is a Web Crawler in Python and Why It’s Important for Data Extraction

A web crawler is an automated program designed to crawl information from the Internet according to specific rules. It accesses web pages by simulating a browser, extracting the required data and storing it locally. This process usually includes selecting the initial URL, downloading the web page content, parsing HTML, following links, and repeating this process to obtain more data. The role of web crawlers in data extraction is crucial because it can efficiently collect information from a large number of web pages and support search engine index building and data analysis tasks.

Advantages of Web Crawler in Python

There are many advantages to writing web crawlers in Python, especially in terms of flexibility and ease of use. First, Python's syntax is concise and easy to learn, allowing developers to quickly get started and implement complex crawling logic. Second, Python has a wealth of libraries and frameworks, such as Scrapy and BeautifulSoup, which greatly simplify the process of web page parsing and data extraction. In addition, Python's cross-platform nature allows crawlers to run on different operating systems, thereby increasing the flexibility of development and deployment.

💡 Related Reading: Web Scraping with Python in 2025

Advanced Techniques for Web Crawling in Python

When it comes to developing a web crawler Python, there are several advanced techniques that can enhance your web scraping capabilities, especially when dealing with dynamic content and anti-scraping measures. These strategies are crucial for overcoming challenges such as JavaScript rendering, CAPTCHA solving, and IP blocking, which are often encountered when building a web crawler Python. Here are some key strategies:

Handling Dynamic Web Pages:
- Use Selenium: This library allows you to automate browser actions, enabling you to wait for JavaScript content to load before extracting data.
- Perform Ajax Requests: Analyze network requests in your browser's developer tools to identify API endpoints. Use the requests library in Python to send direct requests to these endpoints for more efficient data retrieval.
Bypassing Anti-Scraping Measures:
- Utilize Proxies: Implement rotating proxy IPs to distribute requests across multiple IP addresses, making it harder for websites to detect and block your scraping activities.
- Spoof User-Agent: Modify the User-Agent string in your request headers to mimic popular browsers. This helps reduce the likelihood of being flagged as a bot.
Enhancing Efficiency:
- Implement Asynchronous Programming: Use libraries like asyncio and aiohttp to make concurrent requests, significantly speeding up the data extraction process.
- Leverage XPath or CSS Selectors: These tools allow for precise targeting of HTML elements, improving the accuracy and efficiency of your data extraction.

Setting Up Your Python Environment for Web Crawling

Before you start setting up your Web Crawling environment, you need to prepare some basic environment:

Python 3+: Download the installer, double-click it, and follow the installation wizard.
Python IDE: Visual Studio Code or PyCharm with the Python extension. Then, enter the following command in the terminal to initialize a project called python-crawler:

mkdir python-crawler
cd python-crawler
python -m venv env

When doing web crawling, we need to use two libraries for HTTP requests and HTML parsing. The two most popular libraries in Python are:

requests: A powerful HTTP client library that can send HTTP requests and process responses.
beautifulsoup4: A full-featured HTML and XML parser. Type the following commands in the terminal to install them:

pip install beautifulsoup4 requests

In the project folder, create crawler.py and import the project dependencies:

import requests
from bs4 import BeautifulSoup

The project has been built, let's start crawling the web.

How to Scrape Amazon Data Using Python

Scraping data from Amazon can yield content about product information, reviews, and trends. However, Amazon's anti-scraping measures, such as CAPTCHA and IP rate limiting, make the process challenging. In this guide, we'll walk you through how to scrape Amazon data using Python.

How to Build a Simple Web Crawler in Python

After setting up the website crawling environment according to the above steps, you need to follow the steps below to create a Simple Web Crawler in Python.

Step 1: Basic Web Crawler Using Requests and BeautifulSoup

Code Example

import requests
from bs4 import BeautifulSoup

class SimpleWebCrawler:
    def __init__(self, start_url):
        self.start_url = start_url
        self.visited_urls = set()
        self.urls_to_visit = [start_url]

    def crawl(self):
        while self.urls_to_visit:
            current_url = self.urls_to_visit.pop(0)
            if current_url in self.visited_urls:
                continue

            print(f"Crawling: {current_url}")
            response = requests.get(current_url)
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')
                self.visited_urls.add(current_url)
                self.extract_links(soup)

    def extract_links(self, soup):
        for link in soup.find_all('a', href=True):
            absolute_link = link['href']
            if absolute_link not in self.visited_urls and absolute_link not in self.urls_to_visit:
                self.urls_to_visit.append(absolute_link)

if __name__ == "__main__":
    crawler = SimpleWebCrawler("https://example.com")
    crawler.crawl()

Explanation

Initialization: The SimpleWebCrawler class initializes with a starting URL and sets to track visited URLs and URLs to visit.
Crawling Logic: The crawl method processes URLs in the urls_to_visit list, fetching each page's content.
Link Extraction: The extract_links method finds all hyperlinks on the page and adds them to the list of URLs to visit if they haven't been visited yet.

Step 2: Using Scrapy for More Complex Crawling

If your project requires more advanced features like handling multiple requests concurrently or scraping large websites efficiently, consider using Scrapy.

Basic Scrapy Example

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = ['https://example.com']

    def parse(self, response):
        for link in response.css('a::attr(href)').getall():
            yield response.follow(link, self.parse)

Running Scrapy

You can run your Scrapy spider using the command line:

scrapy crawl my_spider

How to Scrape Amazon Data with Python

Next, this section will introduce in detail how to use Python to crawl Amazon data.

Step1. First, we need to get the product page and use the get method to make a request:

url = "https://www.amazon.com/Breathable-Athletic-Sneakers-Comfortable-Lightweight/dp/B0CMTJ7JS7/?_encoding=UTF8&pd_rd_w=XsBL5&content-id=amzn1.sym.61d4ee60-9341-4d7a-912d-bc661951aa32&pf_rd_p=61d4ee60-9341-4d7a-912d-bc661951aa32&pf_rd_r=8M3TP83H0CZQD08XHGBR&pd_rd_wg=6d3lc&pd_rd_r=a6a366f4-4ec7-491f-87ec-67672fe48a55&ref_=pd_hp_d_btf_cr_simh&th=1"
response = requests.get(url)

response.content contains the HTML document generated by the server. This is fed into BeautifulSoup, and the html.parser option lets you specify the parser the library will use:

soup = BeautifulSoup(response.content, "html.parser")

Step2. Next, we need to get the data we want to crawl. We can use CSS selectors to get the corresponding elements.

BeautifulSoup provides two methods, select and select_one, which both support CSS selector strategies.
Before writing code, you can open the devtool tool to view the CSS of the element.

- Get the product title：

product_title = soup.select_one("#productTitle").text

- Get the product description:

description = soup.select_one("#productFactsDesktopExpander ul.a-unordered-list").text

Get the price of a product:

prices = soup.select_one(".a-price-range")
real_price = prices.select(".a-offscreen")
min_price = real_price[0].text
max_price = real_price[1].text

Get product reviews:

star_info = soup.select('.a-meter[role=progressbar]')
five_star = star_info[0].attrs['aria-valuenow'] + '%'
four_star = star_info[1].attrs['aria-valuenow'] + '%'

Step3. Now that we have crawled the website and obtained the data we want, we can extract the crawled information into a csv file.

To do this, add the following to the top of the file:

import csv

Write the crawled data into a csv file:

with open("product.csv", "w") as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([
      "product_title",
      "description",
      "min_price",
      "max_price",
      "five_star",
      "four_star"
    ])
    writer.writerow([
      product_title,
      description,
      min_price,
      max_price,
      five_star,
      four_star
    ])

Run the following command in the terminal to execute the crawl command:

python crawler.py

Step4. After the execution is completed, we can see that the product.csv file appears in your folder. Open this file and we can see the data results we crawled:

The complete code is as follows:

import csv
import requests
from bs4 import BeautifulSoup

url = "https://www.amazon.com/Breathable-Athletic-Sneakers-Comfortable-Lightweight/dp/B0CMTJ7JS7/?_encoding=UTF8&pd_rd_w=XsBL5&content-id=amzn1.sym.61d4ee60-9341-4d7a-912d-bc661951aa32&pf_rd_p=61d4ee60-9341-4d7a-912d-bc661951aa32&pf_rd_r=8M3TP83H0CZQD08XHGBR&pd_rd_wg=6d3lc&pd_rd_r=a6a366f4-4ec7-491f-87ec-67672fe48a55&ref_=pd_hp_d_btf_cr_simh&th=1"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

product_title = soup.select_one("#productTitle").text

description = soup.select_one("#productFactsDesktopExpander ul.a-unordered-list").text

prices = soup.select_one(".a-price-range")
real_price = prices.select(".a-offscreen")
min_price = real_price[0].text
max_price = real_price[1].text

star_info = soup.select('.a-meter[role=progressbar]')
five_star = star_info[0].attrs['aria-valuenow'] + '%'
four_star = star_info[1].attrs['aria-valuenow'] + '%'

with open("product.csv", "w") as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([
      "product_title",
      "description",
      "min_price",
      "max_price",
      "five_star",
      "four_star"
    ])
    writer.writerow([
      product_title,
      description,
      min_price,
      max_price,
      five_star,
      four_star
    ])

How Scrapeless’ Amazon Scraping API Can Simplify Your Web Crawling Tasks

Scrapeless's Amazon Scraping API is designed to automate and simplify the process of extracting data from Amazon, making it a valuable tool for developers and businesses. Unlike using a web crawler Python approach, which often requires extensive manual coding and handling of various challenges like IP rotation or CAPTCHA bypassing, the Scrapeless API streamlines the process. It offers a range of features that enhance efficiency, allowing users to easily collect data such as product prices, reviews, and descriptions without the need for complex Python scripting.

In addition to the Amazon Scraping API, Scrapeless also includes Shopee Scraping API, Lazada Scraping API, Google Trends Scraping API, Google Flights Scraping API, Google Search Scraping API, Airbnb Scraping API, etc., providing a comprehensive solution for web data extraction.

Ready to start scraping effortlessly?
Sign up for Scrapeless today and get your free trial to experience the power of our APIs. Unlock seamless data extraction from top eCommerce platforms like Amazon, Shopee, and more. Don't miss out—get started now!

Advantages over manual Python web crawlers

1. Automation and efficiency

The Amazon Scraping API automates the entire data extraction process, ensuring that users can quickly and accurately collect large amounts of data. This eliminates the complex coding that is typically required for manual Python web crawlers, which often involves dealing with various challenges such as dynamic content and anti-scraping measures.

2. Built-in infrastructure

With Scrapeless' API, users benefit from a robust infrastructure that automatically handles proxy management, IP rotation, and CAPTCHA solving. In contrast, manual Python web crawlers require developers to implement these features themselves, which can be time-consuming and error-prone.

3. Code-free interface

The API provides a code-free interface that allows users to initiate scraping tasks with simple API calls. This is much easier than writing and debugging code for a Python web crawler, so users of different skill levels can use it.

Efficiently extract Amazon data through API

Using Scrapeless’s Amazon Scraping API, users can easily extract structured data by following these steps:

API key generation: Sign up for Scrapeless and generate your unique API key.
Click Scraping API and select Amazon.
Define your requirements: Specify the type of data you want to scrape (e.g., product details, reviews).
Click Start Scraping: Request data from Amazon using simple API calls.
Receive structured data: The Scrapeless API provides the collected data in various formats (e.g., JSON) for analysis or integration into your system.

By leveraging Scrapeless’s Amazon Scraping API, users can greatly simplify their web scraping tasks, allowing them to focus on analyzing insights rather than managing the complexity of web scraping. This powerful tool not only improves productivity, but also ensures compliance with data protection regulations, making it ideal for businesses looking to gain a competitive advantage in their market research efforts.

If you need to integrate Scrapeless into your own project, you can refer to our sample code. You can also click here to view the full documentation.

Request samples - Product

import requests
import json

url = "https://api.scrapeless.com/api/v1/scraper/request"

payload = json.dumps({
   "actor": "scraper.amazon",
   "input": {
      "url": "https://www.amazon.com/dp/B0BQXHK363",
      "action": "product"
   }
})
headers = {
   'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Request samples - Seller

import http.client
import json

conn = http.client.HTTPSConnection("api.scrapeless.com")
payload = json.dumps({
   "actor": "scraper.amazon",
   "input": {
      "url": "",
      "action": "seller"
   }
})
headers = {
   'Content-Type': 'application/json'
}
conn.request("POST", "/api/v1/scraper/request", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))

Request samples - keywords

import http.client
import json

conn = http.client.HTTPSConnection("api.scrapeless.com")
payload = json.dumps({
   "actor": "scraper.amazon",
   "input": {
      "action": "keywords",
      "keywords": "iPhone 12",
      "page": "5",
      "domain": "com"
   }
})
headers = {
   'Content-Type': 'application/json'
}
conn.request("POST", "/api/v1/scraper/request", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))

Join the Scrapeless Discord community today!
Stay updated with weekly news, exclusive updates, and participate in exciting events for a chance to win credits. Don't miss out on the fun—be part of the action now!

FAQ about web crawler python

FAQ #1: What is the difference between web crawlers and web scrapers in Python?

Web crawlers and web scrapers have different uses in the field of data extraction. A web crawler is primarily focused on discovery; it browses websites to find and index URLs, essentially creating a map of the internet or a specific website. The output of a web crawler is usually a list of URLs. In contrast, a web scraper extracts specific data from these URLs, such as product details or pricing information. While both processes involve downloading HTML content, the goal of a crawler is to collect links, while the goal of a scraper is to filter and extract relevant data points from these pages.

FAQ #2: How to handle CAPTCHA when building a web crawler with Python?

Handling CAPTCHA is one of the most challenging aspects of building a web crawler with Python, as it is specifically designed to prevent automated access. Here are some effective strategies for handling CAPTCHA:

Use headless browsers: Headless browsers combined with tools such as Puppeteer or Playwright can help bypass CAPTCHAs by mimicking real browser behavior.
Avoid triggering CAPTCHAs:

2.1 Use a proxy service to rotate IP addresses to prevent detection.

2.2 Randomize request headers (e.g., user agent) and introduce delays between requests to mimic human activity.

While these methods can help bypass CAPTCHAs, always make sure your actions comply with the website's terms of service and legal requirements.

FAQ #3: Is it legal to scrape data from websites like Amazon using Python?

The legality of web scraping depends on a variety of factors, especially when targeting e-commerce platforms like Amazon. Here are some key considerations:

Robots.txt compliance: Websites often include a file that outlines which parts of the site can be scraped. While ignoring it is not illegal in itself, it may be considered unethical or against best practices.
Fair use and public data: If the data is publicly accessible and used for non-commercial purposes (such as academic research), it may fall under "fair use" in some jurisdictions. However, this is not guaranteed to be the case. To avoid legal issues:
Before scraping data, always check the website's terms of service.
If possible, ask for permission.
Use a legal website scraping API, such as Scrapeless.

Conclusion

In this article, we explored the importance of Web Crawler in Python, especially its wide application in e-commerce data crawling. As a flexible and powerful programming language, Python provides a wealth of libraries and tools that can help developers efficiently crawl data from e-commerce platforms and obtain key data such as product information, prices, and comments. However, manually writing and maintaining Web Crawler often requires a lot of time and effort, especially when faced with complex anti-crawler mechanisms.

In this context, Scrapeless's Amazon Scraping API provides an efficient alternative. For users who need to crawl large-scale e-commerce data, Scrapeless API not only simplifies the crawling process, but also automatically handles various complex problems, helping users save time and effort and easily obtain the required Amazon data. Whether it is a small business or a large-scale data demand, Scrapeless is an ideal choice.

DEV Community