DEV Community

Cover image for Web Scraping with Python: A Step-by-Step Guide
Emmanuel Joseph
Emmanuel Joseph

Posted on

Web Scraping with Python: A Step-by-Step Guide

Web scraping is like being a digital Sherlock Holmes, extracting hidden clues (or data) from websites. This guide will show you how to build a simple web scraper in Python using the requests library to fetch web pages and BeautifulSoup to parse HTML content. Grab your virtual magnifying glass and let's get started!

Prerequisites

Before you can start sleuthing, ensure Python is installed on your machine. You will also need to install the requests and BeautifulSoup4 libraries. Think of these as your detective tools. Install them using pip:

pip install requests
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Step 1: Import Libraries

Begin by importing the necessary libraries. No detective can start without their toolkit:

import requests
from bs4 import BeautifulSoup
Enter fullscreen mode Exit fullscreen mode

Step 2: Fetch the Web Page

Use the requests library to fetch the content of the web page you want to scrape. Let's scrape a hypothetical webpage, http://example.com. (Imagine it's the internet's version of 221B Baker Street.)

url = 'http://example.com'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    page_content = response.content
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
Enter fullscreen mode Exit fullscreen mode

Step 3: Parse HTML Content

Time to bring out BeautifulSoup, your HTML parsing sidekick. Together, you'll make sense of the garbled mess that is HTML.

soup = BeautifulSoup(page_content, 'html.parser')
Enter fullscreen mode Exit fullscreen mode

Step 4: Extract Data

Assume we want to extract the title of the page and all the hyperlinks. It's like finding the headlines and the getaway routes. Elementary, my dear Watson!

Extracting the Title

page_title = soup.title.string
print(f"Page Title: {page_title}")
Enter fullscreen mode Exit fullscreen mode

Extracting Hyperlinks

To extract all hyperlinks (<a> tags) and their corresponding URLs:

links = soup.find_all('a')
for link in links:
    href = link.get('href')
    link_text = link.string
    print(f"Link Text: {link_text}, URL: {href}")
Enter fullscreen mode Exit fullscreen mode

Full Example

Combining all the steps, here is the complete script. It’s like the big reveal at the end of a mystery novel:

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the web page
url = 'http://example.com'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    page_content = response.content
    # Step 2: Parse HTML content
    soup = BeautifulSoup(page_content, 'html.parser')

    # Step 3: Extract the title
    page_title = soup.title.string
    print(f"Page Title: {page_title}")

    # Step 4: Extract hyperlinks
    links = soup.find_all('a')
    for link in links:
        href = link.get('href')
        link_text = link.string
        print(f"Link Text: {link_text}, URL: {href}")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
Enter fullscreen mode Exit fullscreen mode

Conclusion

And there you have it, a web scraper worthy of its own detective novel! By using the requests library to fetch web pages and BeautifulSoup to parse and extract information, you can automate data collection from the web. Always remember to respect the robots.txt file of websites and their terms of service to ensure ethical scraping practices. After all, even digital detectives have a code of honor. Happy sleuthing!

Top comments (15)

Collapse
 
msc2020 profile image
msc2020

Good post. Thanks for sharing! I have been made a similar dev.to/msc2020/raspagem-de-dados-d... but in pt-BR for a news site.

Collapse
 
emmanuelj profile image
Emmanuel Joseph

Is good to hear, I went through your work, I find it a little hard to understand your language,

Collapse
 
msc2020 profile image
msc2020

No problem!

Collapse
 
kenara profile image
kenara

I know this is meant as an introduction for beginners but it needs to be said that without JavaScript interpretation, using a tool like Selenium, the results will be extremely limited.

Collapse
 
emmanuelj profile image
Emmanuel Joseph

While the method works well for static pages, many modern websites load content dynamically using JavaScript. The requests library and BeautifulSoup cannot execute JavaScript, which means that they may fail to retrieve dynamically loaded content.

For scraping such websites, you can use tools like Selenium, which can interact with web pages just like a web browser, including executing JavaScript.

Collapse
 
arafatweb profile image
Arafat Hossain Ar

Nice! I have a question.
How this is helpful for a SEO person?

Collapse
 
emmanuelj profile image
Emmanuel Joseph

Web scraping for SEO is like being a digital detective, but instead of solving crimes, you're uncovering keyword capers and backlink bandits. It's like Sherlock Holmes with a keyboard and a penchant for meta tags. You'll enter into competitors' websites like a spy on a mission, extracting juicy keywords and unraveling their content strategies. And just when you think you've cracked the case, you discover broken links lurking in the shadows, like the villains of the SEO world. But fear not, with your trusty scraper in hand you will overcome.

Collapse
 
arafatweb profile image
Arafat Hossain Ar

To be honest,. Just passed right over my head. Could you please provide a summary so I can better understand it?

Thread Thread
 
emmanuelj profile image
Emmanuel Joseph

import requests
from bs4 import BeautifulSoup

Step 1: Fetch the web page

url = 'example.com'
response = requests.get(url)

Check if the request was successful

if response.status_code == 200:
page_content = response.content
# Step 2: Parse HTML content
soup = BeautifulSoup(page_content, 'html.parser')

# Step 3: Extract the title
page_title = soup.title.string
print(f"Page Title: {page_title}")

# Step 4: Extract hyperlinks
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    link_text = link.string
    print(f"Link Text: {link_text}, URL: {href}")
Enter fullscreen mode Exit fullscreen mode

else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")

This is everything put together

Collapse
 
dsaga profile image
Dusan Petkovic

Would be good to host a demo of this on github, would be nice to be able to clone and test it out!

Thanks!

Collapse
 
emmanuelj profile image
Emmanuel Joseph

Yes, thanks I will attach my github link πŸ”— on my profile

Collapse
 
tarek-gritli profile image
Tarek Gritli

Amazing post !

Collapse
 
emmanuelj profile image
Emmanuel Joseph

Thank you Tarek Gritli

Collapse
 
nmiller15 profile image
Nolan Miller

This is much simpler than I thought. Giving me some ideas πŸ€”

Collapse
 
emmanuelj profile image
Emmanuel Joseph

Thank for acknowledging