DEV Community

Cover image for How to use ChatGPT to create a web scraping tool in Python
Education Ecosystem
Education Ecosystem

Posted on

How to use ChatGPT to create a web scraping tool in Python

Image description

Introduction

Web scraping is the process of automatically gathering data from websites using software or scripts. The data can be text, links, images, and other information available on the site. This process is also sometimes referred to as web data extraction, web harvesting or web crawling. The most popular programming language used for creating web scraping scripts is Python. In Python, there are many libraries for web scraping, for example Beautiful Soup, Selenium, and Requests. Depending on the type of scrapping script you are trying to build, some libraries might be better suited.

ChatGPT is a large language model developed by OpenAI that has the ability to generate human-like text. As far as programming goes, ChatGPT has many use cases, including the ability to write simple scripts, such as web scraping scripts. Other use cases include checking code and creating simple static pages written in HTML and CSS. In this article, we are going to see how you can build a web scraping script in Python using ChatGPT.

Example #1: Web scraping for price monitoring

Image description

In the first example, we will use Beautiful Soup to create a simple web scraping script for monitoring the price of the Asus GeForce RTX 3060 listed on Newegg. Here is one example of a prompt you can write on the chat box of ChatGPT:

write a price monitoring script in python beautifulsoup for https://www.newegg.com/asus-geforce-rtx-3060-ti-dual-rtx3060ti-o8g-v2/p/N82E16814126521. Check every 5 minutes.
Enter fullscreen mode Exit fullscreen mode

Here is the code snippet generated by ChatGPT:

import requests
from bs4 import BeautifulSoup
import time

url = 'https://www.newegg.com/asus-geforce-rtx-3060-ti-dual-rtx3060ti-o8g-v2/p/N82E16814126521'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

def check_price():
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    price = soup.find('li', {'class': 'price-current'}).strong.text
    price = float(price.replace(',', ''))
    return price

while True:
    price = check_price()
    print(f'Current price: ${price:.2f}')
    time.sleep(300)  # wait 5 minutes before checking again
Enter fullscreen mode Exit fullscreen mode

Here is the result when the code is run:

Image description

You can also change the requirements as you like depending on how you want to monitor it. For example, you can change the time frequency, or check if it has dropped below a certain range. In addition, you can also get notified via slack chatbot.

It's important to note that ChatGPT may not always generate fully functional code. In fact, when we initially asked ChatGPT to create this script, it produced non-functioning code that required troubleshooting. We analyzed the code to identify the problem, made necessary modifications to the prompt, and then re-ran ChatGPT to generate a working script. It's important to always review and test the code generated by ChatGPT, and make any necessary modifications to ensure it works correctly for your specific use case.

Example #2: write a web scraping for data mining

Image description
In this second example, we will ask chatGPT to scrape the land area of a city in the US:

write a web scraper script in python that gathers the land area listed here https://worldpopulationreview.com/us-cities/seattle-wa-population
Enter fullscreen mode Exit fullscreen mode

Here is the code snippet generated by ChatGPT:

import requests
from bs4 import BeautifulSoup

# specify the url of the website
url = "https://worldpopulationreview.com/us-cities/seattle-wa-population"

# send an HTTP request to the website and get its HTML content
response = requests.get(url)
html_content = response.content

# create a Beautiful Soup object from the HTML content
soup = BeautifulSoup(html_content, "html.parser")

# find the element that contains the land area information
land_area_element = soup.find("div", {"class": "jsx-1487038792 col-md-6 col-lg-4"})

# extract the land area value from the element
land_area_value = land_area_element.find("span").text.strip()

# print the land area value
print("Seattle's land area is:", land_area_value)
Enter fullscreen mode Exit fullscreen mode

Here is the result when the code is run:

Image description

Similarly, you can also specify your requirements in the prompt. You can also specify it to use a different library to scrape data, like Selenium. Data mining is usually done to obtain data needed for data analysis.

ChatGPT is a great tool to write simple scripts that gather data, but this might not be the case if your requirements are complex, like handling errors. This is because ChatGPT is not specifically designed for web scraping. In addition to that, it does not have direct access to the web and cannot directly extract data from websites.

Conclusion

We have now seen 2 examples of how ChatGPT can be used to create web scraping scripts in Python using Beautiful Soup. This is just one use case of ChatGPT. You can also use it to help you debug code or create test cases. It can also assist you in providing guidance on programming concepts, syntax, and algorithms. Additionally, it can help with brainstorming ideas for the types of data that can be scraped and how to structure the code. Overall, while ChatGPT can be a helpful tool for writing web scraping scripts, it is important to use it in conjunction with other programming tools and resources specifically designed for web scraping.

Top comments (0)