DEV Community

Cover image for Getting Started with Web Scraping in Python (For Beginners)
Daniel Azevedo
Daniel Azevedo

Posted on

Getting Started with Web Scraping in Python (For Beginners)

Hey everyone!

If you're just starting out with web scraping, Python is an awesome tool to have in your arsenal. It's straightforward, flexible, and the community has built some amazing libraries to make the process smoother.

So, what exactly is web scraping? Simply put, it's the process of automatically extracting data from websites. Instead of manually copying and pasting information, you can write a script to do that for you in seconds.

Tools You'll Need

To get started, you'll need a couple of essential Python libraries:

  1. Requests: To make HTTP requests and get the page content.
  2. BeautifulSoup: To parse the HTML and extract data.
  3. VS Code: (or your favorite code editor, but I prefer VS Code!) to write and test your Python scripts.

Let’s go through a basic example of scraping using requests and BeautifulSoup.

Setting Up

First, if you don’t have these libraries installed, fire up your terminal or command prompt and install them:

pip install requests beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Simple Web Scraping Example

Let’s start with something super simple. We'll scrape data from a test API called Books to Scrape, which lists books and prices in an easy-to-scrape HTML format.

Here's the code:

import requests
from bs4 import BeautifulSoup

# Send a request to the website
url = "http://books.toscrape.com/"
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract all book titles and prices
books = soup.find_all(class_="product_pod")

for book in books:
    title = book.h3.a['title']
    price = book.find(class_="price_color").text
    print(f"Title: {title}, Price: {price}")
Enter fullscreen mode Exit fullscreen mode

What’s Happening Here?

  1. We use requests.get() to send a request to the website and grab the HTML.
  2. Then we pass the HTML to BeautifulSoup, which helps us parse the page.
  3. Finally, we look for the elements that contain book titles and prices, and print them out.

When you run this in VS Code (make sure to use a Python environment), you'll see the titles and prices of books printed to the console. Easy, right?

Testing with More Complex Pages

Sometimes, pages are more dynamic (using JavaScript to load content), and that's where Selenium comes in. It allows us to interact with dynamic web pages like a real browser.

Here’s an example using Selenium:

  1. Install Selenium:
pip install selenium
Enter fullscreen mode Exit fullscreen mode
  1. Download a driver for your browser (like ChromeDriver for Chrome).
  2. Here’s a quick script that opens a browser, navigates to a page, and grabs content:
from selenium import webdriver

# Set up the webdriver
driver = webdriver.Chrome(executable_path='path_to_chromedriver')

# Open the website
driver.get('http://books.toscrape.com/')

# Get book titles using Selenium
books = driver.find_elements_by_class_name('product_pod')

for book in books:
    title = book.find_element_by_tag_name('h3').text
    print(f"Title: {title}")

driver.quit()
Enter fullscreen mode Exit fullscreen mode

This approach is helpful when websites require interaction or have dynamic content.

Final Thoughts

Web scraping is super useful when you need to gather large amounts of data efficiently. Just remember to always check a website’s robots.txt file to ensure you're not violating any scraping policies, and be mindful of the ethical considerations.

Let me know if you're trying this out in VS Code or have any questions!

Happy coding

Top comments (2)

Collapse
 
stankukucka profile image
Stan Kukučka

i think Beautiful soup is tech outdated library to use for scraping

Collapse
 
dazevedo profile image
Daniel Azevedo • Edited

You're right that BeautifulSoup might be considered a bit outdated for more complex scraping tasks. However, it's still great for simpler projects due to its ease of use. For more advanced scraping, tools like Scrapy or Playwright might be better choices, especially for dynamic content.