1. Setting up Environment
In order to start web scraping with python you first have to set up your development environment.
Firstly, navigate to your project
cd path/to/your/project
and run the command python -m venv venv . This creates a virtual environment in your project's root directory called venv.
Secondly, activate the created virtual environment by running the command source venv/bin/activate on mac or linux.
For windows run the command source venv/Scripts/activate for GitBash
Command Prompt: venv\Scripts\activate.bat
PowerShell: venv\Scripts\Activate.ps1
A virtual environment isolates your projects python dependencies from the rest of your system. This means you can have multiple projects in the same machine, each with its own python version and library version without conflict. This also makes it easier to share your projects with others by listing all your dependencies on a requirements.txt file
Once you activate your virtual environment, your terminal prompt will change. You’ll see the name of the virtual environment in parentheses before your username, for example:
(venv) user@User:~/Personal/WebScraping$
This is a clear sign that your project is running inside its own isolated Python environment. From here, any packages you install (like requests or beautifulsoup4) will only be available inside this project, not system wide which is exactly what we want.
2. Installing Necessary Packages
Now that our virtual environment is active, we can now install the necessary packages we need for web scraping.
The most common libraries used for beginners is:
- Requests -> used to send HTTP requests and download web pages.
- beautifulsoup4 -> a powerful tool that works hand-in-hand with a html parser. It takes the structured data created from the raw HTML and gives you a simple way to search, navigate and extract information from it.
You can install this libraries by running:
pip install requests beautifulsoup4
Once installation is done, it's a good idea to save you dependencies in a file called requirements.txt. This makes it easy to recreate the same environment later or share it with others.
pip freeze > requirements.txt
Now your requirements.txt will list all the installed packages.
3 Writing your First Scraper
Here is a simple Python web scraper that fetches a web page and prints its title.
This example is great for beginners because it shows the entire scraping flow in just a few lines of code:
1.Send an HTTP request to download a page.
2.Parse the page’s HTML with BeautifulSoup.
3.Extract a specific element (the h2 tag in this case).
4.Print the result.
# Import the BeautifulSoup class from the bs4 library
from bs4 import BeautifulSoup
# Import the requests module to send HTTP requests
import requests
# Define the target URL (this is a demo e-commerce site for practicing web scraping)
url = "https://webscraper.io/test-sites/e-commerce/scroll"
# Send a GET request to the URL
# This will download the HTML content of the page and return a response object
response = requests.get(url)
# Parse the HTML content of the page using BeautifulSoup
# The "html.parser" is a built-in parser that converts the raw HTML into a navigable tree
soup = BeautifulSoup(response.content, "html.parser")
# Find the first <h2> element on the page and extract its text content
# .text gets only the text between the tags, without the HTML tags themselves
title = soup.find('h2').text
# Print the result to confirm that everything worked correctly
print("Title:", title)
Here is the expected output of the above script:
4.Important: Web Scraping Best Practices
Before we continue, it's crucial to understand the legal and ethical aspects of web scraping:
- Always check
robots.txt
(e.g.,https://website.com/robots.txt
) - Respect rate limits - don't overwhelm servers
- Check the website's Terms of Service
- Use scraping for legitimate purposes only
- Consider using official APIs when available
5.Recap
In this tutorial, you've successfully:
- Set up an isolated Python development environment using virtual environments.
- Installed essential web scraping libraries (requests and beautifulsoup4)
- Written your first functional web scraper that extracts data from a live website
- Learned crucial best practices for ethical and responsible scraping.
While requests and BeautifulSoup are perfect for learning the fundamentals, modern web scraping often requires more advanced tools. In upcoming posts, we'll explore Playwright for handling JavaScript-heavy sites and n8n for building complete automation workflows without code.
Top comments (0)