DEV Community

Cover image for HTML Parser - Extract information from a LIVE website
Sm0ke
Sm0ke

Posted on • Edited on

HTML Parser - Extract information from a LIVE website

Hello Coder,

In this article, I will present a short-list with code snippets useful to extract information from a live website. The code is written in Python on top of BeautifulSoup HTML Parsing library.


Thank you! Content provided by AppSeed - App Generator.


What is an HTML Parser

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here is basically, crawling the HTML code and extracting, processing relevant information like head title, page assets, main sections.


Setup the environment

To execute the sample code, we need a Python environment and a few useful dependencies:

$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here 
Enter fullscreen mode Exit fullscreen mode

If all goes well, we can start coding. Please type ipython to start the interactive Python console:

# import libraries
import requests
from bs4 import BeautifulSoup as bs

# define the URL to crawl & parse
# feel free to change this URL with your own app
app_url = 'https://flask-bulma-css.appseed.us/'

# crawling the page. This might take a few seconds
page = requests.get( app_url )

# to check the crawl status, just type:
page
<Response [200]> # all good

# to print the page contents type:
page.content 
Enter fullscreen mode Exit fullscreen mode

At this point, we have the page content, let's inject the HTML into BeautifulSoup and get some information from the remote page.

soup = bs(page.content, 'html.parser')

# print the entire page head
soup.head

# print only the title
soup.head.title
<title>Flask Bulma CSS - BulmaPlay Open-Source App </title>
Enter fullscreen mode Exit fullscreen mode

To check the accuracy of the result we can check out the source of the page.


Where to go from here?

Using BS library we can easily manipulate the DOM. For instance, let's print the Javascript files used by the HTML file, using just a few lines of code:


# the code
for script in soup.body.find_all('script', recursive=False):
    print(' Js = ' + script['src'])

# the output
 Js = /static/assets/js/jquery.min.js
 Js = /static/assets/js/jquery.lazy.min.js
 Js = /static/assets/js/slick.min.js
 Js = /static/assets/js/scrollreveal.min.js
 Js = /static/assets/js/jquery.waypoints.min.js
 Js = /static/assets/js/jquery.waypoints-sticky.min.js
 Js = /static/assets/js/jquery.counterup.min.js
 Js = https://cdnjs.cloudflare.com/ajax/libs/modernizr/2.8.3/modernizr.min.js
 Js = /static/assets/js/app.js
Enter fullscreen mode Exit fullscreen mode

Let's print the content of the app.js file:


# app_url is initialized a few line above:
# app_url = 'https://flask-bulma-css.appseed.us/'
app_js = requests.get(app_url + '/static/assets/js/app.js')

# to check the status, just type the name of the object
app_js
<Response [200]> # all good, let's print the content of the remote file

app_js.content
# some unminified js code will be listed here. 
Enter fullscreen mode Exit fullscreen mode

Let's print level one elements from the page body:


# the code
for elem in soup.body.children:
   if elem.name: # we need this check, some elements don't have name
      print( ' -> elem ' + elem.name )

# the output
 -> elem div
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem footer
 -> elem div
 -> elem div
 -> elem div
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
Enter fullscreen mode Exit fullscreen mode

Let's print the footer:

soup.footer

# to have a nice print of elements, we can use BS prettify() helper
# using prettify(), the output is nicely indented 

print(soup.footer.prettify())

# the output
<footer class="footer footer-dark">
 <div class="container">
  <div class="columns">
   <div class="column">
    <div class="footer-logo">
     <img alt="Footer Logo for BulmaPlay - JAMStack Bulma CSS Web App." src="/static/assets/images/logos/bulmaplay-logo.png"/>
    </div>
....
    </div>
   </div>
  </div>
 </div>
</footer>

Enter fullscreen mode Exit fullscreen mode

And the last code snippet, let's print the anchors referred in the footer section:


# the code
for elem in soup.body.footer.find_all('a'):
    print(' footer href = ' + elem['href'])

# the output
 footer href = https://bulma.io
 footer href = https://github.com/app-generator/flask-bulma-css
 footer href = https://appseed.us/apps/bulma-css?flask-bulma-css
 footer href = https://blog.appseed.us/tag/bulma-css
 footer href = https://absurd.design/
 footer href = https://github.com/cssninjaStudio/fresh

Enter fullscreen mode Exit fullscreen mode

Related Articles


Other Parsing Resources


Useful? AMA in the comments. Thank you & happy HTML parsing!

Top comments (3)

Collapse
 
azadnsu profile image
Azad Husen

Can I convert any HTML website to mobile responsive?

Collapse
 
areahints profile image
Areahints

Awesome guide!

I made a simple scraper that writes the page to HTML. so I have an offline copy of the site, will your tuts work with the offline copy?

Collapse
 
sm0ke profile image
Sm0ke

Hello, thank you!
Yes, you can load the HTML from a file, instead of crawling. Please take a look at this article:

dev.to/sm0ke/html-parser-developer...

The relevant code snippet:


# read_file retun the file content as string
html_content = read_file('index.html')
soup  = bs(html_content,'html.parser') 

Happy parsing!
.. <('_')> ..