Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

Web Crawling

A Web crawler, sometimes called a **spider **or **spiderbot **and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

Prosecuting Computer Crimes

Robots.txt

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

Example: https://booking.com/robots.txt

Web Scraping Sandbox

download_page.py

import requests
from bs4 import BeautifulSoup

URL = 'https://toscrape.com/'

web_scraping_sandbox = requests.get(URL)

web_scraping_sandbox.status_code 
# output --> 200

web_scraping_sandbox.text

# output -->
<!DOCTYPE html>
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Scraping Sandbox</title>
        <link href="./css/bootstrap.min.css" rel="stylesheet">
        <link href="./css/main.css" rel="stylesheet">
    </head>
    <body>
        <div class="container">
            <div class="row">
                .
                .
                .
            </div>
        </div>
    </body>
</html>

web_scraping_sandbox.headers
# output --> 
{'Date': 'Fri, 13 Oct 2023 02:20:03 GMT', 'Content-Type': 'text/html', 'Content-Length': '3939', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 08 Feb 2023 21:02:33 GMT', 'ETag': '"63e40de9-f63"', 'Accept-Ranges': 'bytes', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload'}

web_scraping_sandbox.request.method
# output --> Method: GET

Adding BeautifulSoup

soap = BeautifulSoup(web_scraping_sandbox.text, 'lxml')
print(type(soap))
# output --> <class 'bs4.BeautifulSoup'>

print(soap.find('h2'))
# output --> <h2>Books</h2>

Under Construction

DEV Community

Web Scraping

Web Scraping

Web Crawling

Robots.txt

Top comments (0)