DEV Community

Web Scraping with Python: A Step-by-Step Guide

Emmanuel Joseph on May 20, 2024

Web scraping is like being a digital Sherlock Holmes, extracting hidden clues (or data) from websites. This guide will show you how to build a simp...

Read full post

msc2020 • May 21 '24

Good post. Thanks for sharing! I have been made a similar dev.to/msc2020/raspagem-de-dados-d... but in pt-BR for a news site.

Emmanuel Joseph • May 21 '24

Is good to hear, I went through your work, I find it a little hard to understand your language,

msc2020 • May 27 '24

No problem!

Arafat Hossain Ar • May 21 '24

Nice! I have a question.
How this is helpful for a SEO person?

Emmanuel Joseph • May 21 '24

Web scraping for SEO is like being a digital detective, but instead of solving crimes, you're uncovering keyword capers and backlink bandits. It's like Sherlock Holmes with a keyboard and a penchant for meta tags. You'll enter into competitors' websites like a spy on a mission, extracting juicy keywords and unraveling their content strategies. And just when you think you've cracked the case, you discover broken links lurking in the shadows, like the villains of the SEO world. But fear not, with your trusty scraper in hand you will overcome.

Arafat Hossain Ar • May 21 '24

To be honest,. Just passed right over my head. Could you please provide a summary so I can better understand it?

Emmanuel Joseph • May 21 '24

import requests
from bs4 import BeautifulSoup

Step 1: Fetch the web page

url = 'example.com'
response = requests.get(url)

Check if the request was successful

if response.status_code == 200:
page_content = response.content
# Step 2: Parse HTML content
soup = BeautifulSoup(page_content, 'html.parser')

# Step 3: Extract the title
page_title = soup.title.string
print(f"Page Title: {page_title}")

# Step 4: Extract hyperlinks
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    link_text = link.string
    print(f"Link Text: {link_text}, URL: {href}")

else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")

This is everything put together

Tarek Gritli • May 22 '24

Amazing post !

Ahmed Gargouri • Jan 11

3ejbetek el post?

Emmanuel Joseph • May 23 '24

Thank you Tarek Gritli

kenara • May 21 '24

I know this is meant as an introduction for beginners but it needs to be said that without JavaScript interpretation, using a tool like Selenium, the results will be extremely limited.

Emmanuel Joseph • May 21 '24

While the method works well for static pages, many modern websites load content dynamically using JavaScript. The requests library and BeautifulSoup cannot execute JavaScript, which means that they may fail to retrieve dynamically loaded content.

For scraping such websites, you can use tools like Selenium, which can interact with web pages just like a web browser, including executing JavaScript.

Dusan Petkovic • May 20 '24

Would be good to host a demo of this on github, would be nice to be able to clone and test it out!

Thanks!

Emmanuel Joseph • May 21 '24

Yes, thanks I will attach my github link 🔗 on my profile

Nolan Miller • May 20 '24

This is much simpler than I thought. Giving me some ideas 🤔

Emmanuel Joseph • May 21 '24

Thank for acknowledging

ManTheFan • Apr 21

Loved this guide! I’ve been through the whole blocked-IP hell too 😅. What helped me: BrightData (super stable, expensive), Oxylabs (nice balance), and cloud.hiphop — dirt cheap and does the job well. Worth checking out if you’re scraping anything at scale.