DEV Community: Edward Pasenidis

Basics of Scraping with Python 🐍

Edward Pasenidis — Sun, 26 Jul 2020 14:57:13 +0000

Prologue

Hello, in this post I am gonna describe the process of writing a scrapper script in Python, with the help of the Beautiful Soup library.

Installing the dependencies

First of all, since Beautiful Soup is a 3rd-party community project, you have to install it via the PyPI registry.

pip install beautifulsoup4

Philosophy of Beautiful Soup

BS is a library that sits atop an HTML/XML parser (in our case it's the prior)

Basic Script

Now that we know how it works, let's write a tiny script:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests


WEBSITE = "https://google.com"


html = urlopen(WEBSITE)
bs = BeautifulSoup(html.read(), 'html.parser')

In this example, we also make use of the urllib requests library, this just downloads the HTML for us.
Then, we read it with the pre-declared html variable that contains the google.com document

Parsing data

Sometimes, we want to get specific parts of a document, such as a paragraph or an image.

You can search for a specific HTML tag in BeautifulSoup with the find() attribute.

Let's scrape the Google logo tag from their homepage!
Add the following lines of code to the already existing file:

google_logo = bs.find('img', { 'id': 'hplogo' })
print(google_logo)

This two lines of code will hopefully produce this output:

<img 
alt="Google" 
height="92" 
id="hplogo" 
src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png"
style="padding:28px 0 14px" 
width="272"/>

So, how does this work?
Well, we are using the find() method and passing to it some arguments.
To be exact, we are telling it that we are searching for an <img> tag with an id called 'hplogo'

Epilogue

That's all.
To learn more about Beautiful Soup, read the docs

A personal blog for fun

Edward Pasenidis — Thu, 23 Jul 2020 19:09:13 +0000

I made a blog with VueJS that makes AJAX requests to DEV.TO and parses the articles I've written. This way, I don't need any backend or DB.

Making apps during quarantine!

Edward Pasenidis — Thu, 26 Mar 2020 05:05:02 +0000

Boring, huh?

Quarantine, a different perspective of "staying home as usual", only it's unusual and you can't go out if you get bored. Bad, huh? Eventually it makes you bored - that much that I created a COVID-19 tracker.
But how does it work? I mean, what's the difference of it from many others crappy trackers? Well this one is developed by two people & it contains time charts :) (https://covid-19-system.herokuapp.com/developers)

What is this tracker all about?

I mean, now you compare two time periods (e.g.: December & March)
Kinda useless? Maybe, but social media like using phrases like "COVID-19 infection rate has raised, 5% more than it was in February" and things like that. Who knows, maybe journalists will use that thing. The funny part is that the API wasn't even created by us, yeah - you heard right!
Basically, we will be utilizing a second API soon which is also not ours!
That's open-source for you, beginners! (yes, especially contributing is amazing). Back to our topic, we won't even implemented a custom API, although I may also do this later. Anyways, we will be adding more charts, country search, better mobile responsibility & much more.

Now, let's see how that thing works behind the hood...

Exploring the project

So, if you git clone the site repository you will basically download the repository. Let's start exploring it - open the src folder to get started. See? There are many files; some are for Pug, other are for browser JS, there is also one CSS, nevertheless there are many things on that repo.

But how do they talk?

Well, if you type npm start, a node express server will start. Express is responsible for the routes & some minor things inside the repo.

Then comes Pug, a HTML pre-processor, something like a library that replaces placeholders inside HTML, with real content!

Next coming up is the public directory which contains CSS files and JavaScript that runs in browser (not related to Node, it's linked by Pug),
this fetches information, from an API that you can find on the GitHub project repository as soon as this article ends. [1]

This was a brief documentation, I am not gonna dive deeper; you will be able to do that yourself when the major release will be ready!

Let's not forget to mention the developers;

Me, (Edward, also the writer of this post)
Lean, (Tasos, a cool dude who has developed from Discord bots to an Arduino-to-Discord webhook system)

Some important links

[1]. https://github.com/pasenidis/covid19-stats
[2]. https://github.com/pasenidis
[3]. https://github.com/TasosY2K