How to extract all links in a website using Python

#python #codenewbie #tutorial #webscraping

Hello Pythonistas,
In this tutorial, you’re going to learn how to extract all links from a given website or URL using BeautifulSoup and requests.

If you’re new to web scraping I would recommend starting first with beginner tutorial to Web scraping and then move to this one once you get comfortable with the basics.

how do we extract all links?

We will use the requests library to get the raw HTML page from the website and then we are going to use BeautifulSoup to extract all the links from the HTML page.

Requirements

To follow through with this tutorial you need to have requests and Beautiful Soup library installed.

Installation

$ pip install requests
$ pip install beautifulsoup4

Below is a code that will prompt you to enter a link to a website and then it will use requests to send a GET request to the server to request the HTML page and then use BeautifulSoup to extract all link tags in the HTML.

import requests
from bs4 import BeautifulSoup

def extract_all_links(site):
    html = requests.get(site).text
    soup = BeautifulSoup(html, 'html.parser').find_all('a')
    links = [link.get('href') for link in soup]
    return links

site_link = input('Enter URL of the site : ')
all_links = extract_all_links(site_link)
print(all_links)

Output:

kalebu@kalebu-PC:~/$ python3 link_spider.py
Enter URL of the site: https://kalebujordan.com/

['#main-content', 'mailto://kalebjordan.kj@gmail.com', 
'https://web.facebook.com/kalebu.jordan', 'https://twitter.com/j_kalebu',
'https://kalebujordan.com/'.....]

Hope you find it useful, now share it with your fellow developers

The Original Article can be found on kalebujordan.com

DEV Community

How to extract all links in a website using Python

how do we extract all links?

Requirements

Installation

Top comments (0)

Read next

Day 4 - None Datatype & input() function in Python

Automating Flask & PostgreSQL Deployment on KVM with Terraform & Ansible

Bare-Metal Embedded Programming on K230 Using Rust

Amazon SQS: The Backbone of Asynchronous Communication