Web Scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
Web Crawling
A Web crawler, sometimes called a **spider **or **spiderbot **and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
- Prosecuting Computer Crimes
Robots.txt
robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.
Example: https://booking.com/robots.txt
download_page.py
import requests
from bs4 import BeautifulSoup
URL = 'https://toscrape.com/'
web_scraping_sandbox = requests.get(URL)
web_scraping_sandbox.status_code
# output --> 200
web_scraping_sandbox.text
# output -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Scraping Sandbox</title>
<link href="./css/bootstrap.min.css" rel="stylesheet">
<link href="./css/main.css" rel="stylesheet">
</head>
<body>
<div class="container">
<div class="row">
.
.
.
</div>
</div>
</body>
</html>
web_scraping_sandbox.headers
# output -->
{'Date': 'Fri, 13 Oct 2023 02:20:03 GMT', 'Content-Type': 'text/html', 'Content-Length': '3939', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 08 Feb 2023 21:02:33 GMT', 'ETag': '"63e40de9-f63"', 'Accept-Ranges': 'bytes', 'Strict-Transport-Security': 'max-age=0; includeSubDomains; preload'}
web_scraping_sandbox.request.method
# output --> Method: GET
Adding BeautifulSoup
soap = BeautifulSoup(web_scraping_sandbox.text, 'lxml')
print(type(soap))
# output --> <class 'bs4.BeautifulSoup'>
print(soap.find('h2'))
# output --> <h2>Books</h2>
Under Construction
Top comments (0)