Web Scraping CLI tool for scanning websites

#python #scrapy #webscraping #programming

Web Scraping CLI tool for scanning websites

This Python script is a web scraper built using several Python libraries, including click, requests, beautifulsoup4, chompjs, and rich.

Import Libraries First, the script starts by importing the necessary libraries.

Define Main Function: scrape The scrape function is the main function of the script, which is decorated with click annotations to allow command-line arguments and options. It accepts a URL to scrape, a flag to save the result, and a filename for saving the result.

Instantiate Console Inside the scrape function, a rich Console object is created, which provides more aesthetically pleasing console output.

Time the Request The time it takes to make the request to the provided URL is calculated. This can be useful for performance considerations.

Parse HTML The response from the request is parsed using the BeautifulSoupclass from the beautifulsoup4 library. This enables easy access and manipulation of the webpage's HTML.

Print Request Information Information about the request is printed to the console, including the original URL requested, the domain, the final URL after any redirects, the response time, the response size, the status code, the response headers, the request headers, and any cookies set by the server.

Print HTML Code The parsed HTML is then pretty-printed to the console using the rich library's Syntax class, unless the --save flag was used when calling the script.

Find JavaScript Objects The script then attempts to parse any JavaScript objects found in script tags on the page. This is done using the get_js_objects function, which is defined later.

Save HTML If the --save flag was used when calling the script, the HTML of the webpage is written to a file with the provided filename.

Define get_js_objectsFunction The get_js_objects function is used to find and parse JavaScript objects in script tags of the webpage. This function is used in the scrape function to extract any JavaScript data on the webpage.

Run the script Finally, if the script is run as the main file, the scrape function is called with the command-line arguments and options.

Here's the code in full:

import click
import requests
from bs4 import BeautifulSoup
from chompjs import parse_js_object
from rich.console import Console
from rich.syntax import Syntax
import time
from urllib.parse import urlparse

@click.command()
@click.argument('url')
@click.option('--save', is_flag=True, help='Save results to a file')
@click.option('--filename', default='results.html', help='Specify the filename')
def scrape(url, save, filename):
    console = Console()

    start_time = time.time()
    response = requests.get(url, allow_redirects=True)
    end_time = time.time()

    response_time = end_time - start_time
    response_size = len(response.content)

    soup = BeautifulSoup(response.content, 'html.parser')

    console.print(f"URL Requested: {url}", style="bold green")
    console.print(f"Domain: {urlparse(url).netloc}", style="bold green")
    console.print(f"Final URL: {response.url}", style="bold green")
    console.print(f"Response time: {response_time} seconds", style="bold green")
    console.print(f"Response size: {response_size} bytes", style="bold green")
    console.print(f"Status code: {response.status_code}", style="bold green")
    console.print(f"Response headers: {response.headers}", style="bold green")
    console.print(f"Request headers: {response.request.headers}", style="bold green")
    console.print(f"Cookies: {response.cookies}", style="bold green")

    syntax = Syntax(soup.prettify(), "html", theme="monokai", line_numbers=True)
    if not save:
        console.print("HTML code: ", style="bold red")   
        console.print(syntax)

    get_content_sources, failed = get_js_objects(response)
    console.print("JavaScript data sources found: ", style="bold red")
    console.print(get_content_sources)

    console.print("JavaScript data sources failed: ", style="bold red")
    console.print(failed)

    if save:
        with open(filename, 'w') as f:
            f.write(soup.prettify())

def get_js_objects(response: requests.models.Response) -> list:
    script_tags = BeautifulSoup(response.content, 'html.parser').find_all('script')
    all_data_sources = []
    failed = []
    for script in script_tags:
        if script.string:
            try:
                all_data_sources.append(parse_js_object(script.string))
            except Exception:
                failed.append(script.string)

    return all_data_sources, failed

if __name__ == '__main__':
    scrape()

If you found this useful, don't forget to follow me on my social networks! Linktree