DEV Community: Kajal

8 ML/AI projects to make your portfolio stand out.🔥

Kajal — Tue, 04 Aug 2020 12:34:39 +0000

Sentiment analysis for depression based on social media post⠀ ⠀
Sports match video to text summarization using neural network ⠀
Handwritten equation solver using CNN⠀ ⠀
Business meeting summary generation using NLP ⠀
Facial recognition to detect mood and suggest songs accordingly⠀ ⠀
Finding out habitable exo-planet from images captured by space vehicle like Kepler⠀ ⠀
Image regeneration for old damaged reel picture⠀ ⠀
Music generation using deep learning

If you want to explore above listed ideas you can go through this blog.

Scraping websites using Octoparse

Kajal — Tue, 28 Jul 2020 16:34:59 +0000

Web Scraping Series : Using Python and Software
1. Scraping web pages without using Software : Python
2. Scraping web Pages using Software : Octoparse

INTRODUCTION

WHY THIS ARTICLE?

This article is a second of series of Web-scraping series....
AS I mentioned before in my first article, that I choose to write article about scraping because during building my project Fake-News Detection System, It took me days to research for it accordingly, As I wasn't able to find dataset according to my need.

So, If you didn't go through my first article, I would strongly recommend to go through that once and If you have a programming background, then you must read first article of this series.

WHOM THIS ARTICLE IS USEFUL FOR?

Since, for users having programming background, I have already written a blog and who is having knowledge about python in specific,I would suggest to do scraping using that instead of any software because I find it easy to do it using python as compare to spend days on understanding interface of any particular software.

But the people out there, who don't have any programming background in particular, you can follow along with me and get familiar with the interface & working of this software.

OVERVIEW

This article covers the second part of the series, Scraping web-pages using software : Octoparse.

However, there are many softwares that you can found easily on internet for automating the purpose like

ParseHub, ScarpeSimple, Diffbot, Mozenda.

Brief Introduction to diff automate softwares:

1.ParseHub:
Website: https://www.parsehub.com/

Purpose: Parsehub is a phenomenal tool for building web scrapers without coding to extract tremendous data. It is used by data scientists, data journalists, data analysts, E-commerce websites,job boards, marketing & sales, finance & many more.

Features: It's interface is dead simple to use, you can build web scrapers simply by clicking on data that you want.It then exports the data in JSON or Excel format. It has many handy features such as automatic IP rotation, allowing scraping behind login walls, going through dropdowns and tabs, getting data from tables and maps, and much much more. In addition, it has a generous free tier, allowing users to scrape up to 200 pages of data in just 40 minutes! Parsehub is also nice in that it provides desktop clients for Windows, Mac OS, and Linux, so you can use them from your computer no matter what system you’re running.

2.ScrapeSimple:
Website: https://www.scrapesimple.com

Purpose: ScrapeSimple is the perfect service for people who want a custom scraper built for them. Web scraping is made as simple as filling out a form with instructions for what kind of data you want.

Features: ScrapeSimple lives up to its name with a fully managed service that builds and maintains custom web scrapers for customers. Just tell them what information you need from which sites, and they will design a custom web scraper to deliver the information to you periodically (could be daily, weekly, monthly, or whatever) in CSV format directly to your inbox. This service is perfect for businesses that just want a html scraper without needing to write any code themselves. Response times are quick and the service is incredibly friendly and helpful, making this service perfect for people who just want the full data extraction process taken care of for them.

3.Diffbot:
Website: https://www.diffbot.com

Purpose: Enterprises who who have specific data crawling and screen scraping needs, particularly those who scrape websites that often change their HTML structure.

Features: Diffbot is different from most page scraping tools out there in that it uses computer vision (instead of html parsing) to identify relevant information on a page. This means that even if the HTML structure of a page changes, your web scrapers will not break as long as the page looks the same visually. This is an incredible feature for long running mission critical web scraping jobs. While they may be a bit pricey (the cheapest plan is $299/month), they do a great job offering a premium service that may make it worth it for large customers.

4.Mozenda:
Website: https://www.mozenda.com/

Purpose: Enterprises looking for a cloud based self serve webpage scraping platform need look no further. With over 7 billion pages scraped, Mozenda has experience in serving enterprise customers from all around the world.

Features: Mozenda allows enterprise customers to run web scrapers on their robust cloud platform. They set themselves apart with the customer service (providing both phone and email support to all paying customers). Its platform is highly scalable and will allow for on premise hosting as well. Like Diffbot, they are a bit pricy, and their lowest plans start at $250/month.

Although I am going to talk about Octoparse in detail in this article, since I have used that only.

OCTOPARSE

Website: https://www.octoparse.com/

Purpose: Octoparse is a fantastic tool for people who want to extract data from websites without having to code, while still having control over the full process with their easy to use user interface.

Features: Octoparse is the perfect tool for people who want to scrape websites without learning to code. It features a point and click screen scraper, allowing users to scrape behind login forms, fill in forms, input search terms, scroll through infinite scroll, render javascript, and more. It also includes a site parser and a hosted solution for users who want to run their scrapers in the cloud. Best of all, it comes with a generous free tier allowing users to build up to 10 crawlers for free. For enterprise level customers, they also offer fully customized crawlers and managed solutions where they take care of running everything for you and just deliver the data to you directly.

Step by Step explanation to extract data from 1000's of news articles

Step-1: Download Octoparse

Go to website : https://www.octoparse.com/download
and follow along the guidelines of community.

Step-2: Sign-up

After completing with downloading & installing, sign up for account if you haven't created before.

Step-3: Explore it

Before start on your own, I would strongly recommend you to please explore different sections of it that will be ultimately going to help you in interacting with this interface while working on it later on.
- Go through the popular template section, there are some popular templates of popular websites and you might find your required data there.
- Go through the tutorials on both template mode & advanced mode

Step-4: Enter URL

If you want to scrape data from just one website, you can simply paste your copied url at home page and click start.
But If you want to scrape data from more than one website. then, go to NEW tab & then, click ADVANCED option.
You will see a new window like this, in which you can easily organised your work with more advanced options and track your directories too.
- So, you can upload upto 10k different URLs here but the condition is layout of these different URLs should be same to be able to extract data all-together, otherwise it will automate the process but give you separate result for different website's layout. So you will not get all data merge together.

Step-5: Specifying scraping details & attributes

Click save and you will see a window like this:
Where, left section is for maintaining workflow, center will display web-page of first url that you have entered and below section will show data preview.

In there, you have two options to move further: Auto-detect web-page data

You would either choose to "Auto-detect web page data" that will scrape imp features according to it's understanding and return you with five different results.In which you can choose to skip what is of no use to you or keep it if it scraped all attributes as per your wish.

In the below dialog box, you can choose to do editing as per your need.

So, In above image you are provided with 3 options that you can enable or disable as per your need.

First: Scroll down the page to load more data. So if a web-page is not divided into different pages, It might be possible that all of the data is on the same page & for extracting all data all you need is to enable "load more data" option.
Second: Click the next button to capture multiple pages
Enabling it will do pagination upto pages that you select as a "next button".
- So it will allow you to check or edit next button. On clicking on check, you will see next button highlighted on web-page section that is detected automatically.
- And If it is not detected correctly automatically, click on edit > now on click on anything on web-page screen that you wanted to detect as a "next" button. So, for instance, suppose there is no "next" button instead it might be ">" working as "next button" or if you don't want to automate the process of scraping till the last page of web-page, you can choose to paginate upto a specific page like: "1", "2","3" & so on...
Third: Click the state_url to capture data on the page that follows
- It will allow you to capture content or text of a page that follows and make another attribute containing text(means content of the page opens up on clicking on particular url).

There is one more option in TIPS, "Switch auto-detect results(1/5)", So on clicking this link, you will able to see 5 different sets of auto detected datasets.You can keep according to your need.
- After being done with the editing click on "save settings"
- You can see scraped results in "Data Preview" and edit "Attribute names" too.

you will see changes in your workflow as follows:

Edit task workflow manually

Or you can choose to edit workflow manually as per your need & select specific element from web-page to be shown in your dataset as attributes.

In there, on pointing on down-arrows, you will find a "+" sign to add elements as per your need.

It will help you to be specific & organized with your work. As in previous case of auto-detecting data, there are too many irrelevant attributes that it scraped automatically. So to be specific with what you need , I would suggest to opt for second option.
It also allow you to rename, delete or edit any specific element or you can choose to change settings as per need.
I will demonstrate you with an example of extracting 1000's of articles having 6 attributes: News Headline, Link, Source, Stated on, Date, Image_url.

Extract data manually

To extract all of them, go to web-page section > select specific details of very first article like: "News Headline", "Link of news", "Source of news", "Stated On", "Date" > by simply clicking on these items on very first article > the portion will be highlighted as shown in below window:

Data extraction for all news-articles on first page of listed url

then, choosing "Select all" option will select same specific details of each article upto the last of web-page. you will see captured 30 lines in data-preview as below:

Now, clicking on "Extract data" option will extract all details of all articles listed on 1st page of entered url.
- You can now see changes in the workflow:

a dialog-box will open to ask for extract more elements:

Since, we haven't scraped image_url, we will select it separately and same procedure as listed above:
- select image on web-page section:

A dialog-box will come with diff options > Select "Select All" option.

Another pop-up window will come asking for diff options to choose > select "Extract image urls"

and you are all done with scraping image urls as well for all news article of first page & It will add a new attribute to your data-preview.
- This is how your data-preview will look like after editing attribute names:

Again a pop-up window will come asking to extract more element, since we wan't to extract data from more than one page, we will do pagination.

Pagination

Now, if you are needed with a huge data, you can loop over till the specific page or to the last page of listed url.
- To do pagination, all you need is to search for the keyword that is specifying next page for particular web-page like: "next", ">", or anything. > click on that keyword > In my case, It is "next" button itself > click on it > It will highlighted & a new window will pop-up.

select "Loop click next page"
After you are done, your workflow will now look something like auto one's workflow.

When you are done with all editing and have organized data, click save & run.

Step-6: Exporting data to your machine

Clicking on save & Run option will open up with new window listing 3 options to choose how to run it.

If you are a premium user, only then you can access last two options which is provided with features like scraping any website on daily basis, weekly basis, monthly basis & many more other features. Their server will take care of your data and sending it to you after organizing it on the basis whatever you choose to do.
If you are not a free user, select first option, "Run task on your device"
It will start extracting all of data, & also you need to pay a little attention while it is extracting data although it is done by automating, because if data limit exceeds 10k, It will stop & you will have to sit for another hour waiting to extract it from scratch.

You are also needed to wake your system up because if your screen sleeps while extracting data , It might possible that it will stop extracting data at that point & again you will have to start it again to extract more or as much as you can.

Choose "export data":

choose format to save your file:

So I choose to extract my file as .csv file & save it to my desktop & don't close this window, we will use it to extract data as .xlsx format again.

Now, let's have a look at data that we have extracted.

Oh-oh! It's in suspicious form that is not readable & organized at all

Let's go to the export window again: export data > choose .xlsx format this time > click ok.

Now, let's have a look at exported data in .xlsx format.

Voila! Now, It does make sense, all ambiguity is now removed from our data.

Step-7: Formatting Excel file using formulas

Inspecting Dataset

first thing you will do is inspect your Dataset, on inspecting my Dataset, I found some irrelevant things that I couldn't able to edit at time of scraping.

So we will do some formatting task on excel file.

I. Look at my "Image url" attribute, So I extract image url, to extract label from it, Since Label value is written within image url, I didn't find any better option to extract it.

So if you notice "Image url" attribute closely, there is a small string ".jpg" & a bigger string "https://static.politifact.com/img/meter-" which is common for all rows of "Image url" attribute.
So, We will replace both the strings with "" to get my label values.
- press ctrl+H > fill field "find what" with ".jpg" & then, fill field "replace with" with ""(means you don't have to specify anything there) > click "Replace All" > press "OK" & you are done with your labels.

This is how your attribute will look like after formatting,

As you can see there are two more problems with it, first is value in first cell is a hyperlink . second is the extra spaces.

To "remove hyperlinks" from a particular cell > right click on it > select "remove hyperlinks" from a drop down. And to remove from entire column > select entire column > right click on it > select "remove hyperlinks" from a drop down.
To remove "extra spaces" from a particular attribute > go to any empty cell > write formula : =TRIM(address of the first cell of attribute) > press enter > you will see formatted value of first cell > to apply changes to all cell > drag & drop first cell upto the last cell of particular attribute. > you will see all values inserted in a format applied > now replace new column with old column > by selecting new column entirely > copy it > then, select old column entirely where you need to paste it > go to paste options > select paste values(V) option from drop-down.

fantastic! you are all done with the "Label" column.
Have a look now:

II. Look at my other attribute "Stated On", from which data of my concern is only date.

To delete text other than that, we will do it in two steps:

As you already have seen that for similar pattern of strings repeating throughout the column, we can use previous pattern of finding it & replacing it with nothing. So, for sub-string "stated on" we will replace it with nothing.

This is how our column will look like:

On noticing above image, you will find the other sub-string is different for all rows of this column, then what to do with that, Since we only wanted date, we will extract it this time:

Let's see : In a new empty cell > type command "=MID(address of first cell of attribute stated on, starting index of string that you want to extract, number of characters upto what you want to extract" > press enter > you will see first formatted value > then, repeat the steps above to change all values & replace it with new values.

Format of writing formula:

On pressing enter:

Have a look at new "Stated on" attribute:

III. In "Date" attribute, we don't want text other than date, also we can't use "mid" formula here, as the date is specified at suffix of a string & there is no clarity about starting point, as it is changing for all diff cells.

So, we will going to achieve this task by using "RIGHT":

Go to new empty cell > type command "RIGHT(E2,LEN(E20-FIND("•",E2))" > press enter > and do steps same as above to replace new column inserted values with that of old values.

write formula:
press enter:
New column will look like:

Our Final Dataset after all formatting:

So, This data is now all cleaned & ready to use, I hope you guys will find this article informative & useful for you. Do share your thoughts about it in comment box & do let me know, if you might have any queries. ✌️

You can reach me via following :

Subscribe to my YouTube channel for video contents coming soon here
Connect and reach me on LinkedIn

Scraping 1000's of News Articles using 10 simple steps

Kajal — Tue, 28 Jul 2020 16:29:36 +0000

Web Scraping Series: Using Python and Software
1. Scraping web pages without using Software: Python
2. Scraping web Pages using Software: Octoparse

Table Of Content
1.1 Introduction
1.1.1 Why This article?
1.1.2 WHOM THIS ARTICLE IS USEFUL FOR ?
1.2 Overview
1.2.1A brief introduction to webpage design and HTML
1.2.2Web-scraping using BeautifulSoup in PYTHON
Step-1:Installing Packages
Step-2:Importing Libraries
Step-3:Making Simple requests
Step-4:Inspecting the Response Object
Step-5:Delaying request time
Step-6:Extracting Content from HTML
Step-7:Finding elements and attributes
Step-8:Making Dataset
Step-9:Visualising Dataset
Step-10:Making csv file & saving it to your machine
1.3 Suggestion & conclusion
1.3.1Full Code

INTRODUCTION

WHY THIS ARTICLE?

Aim of this article is to scrape news articles from different websites using Python. Generally, web scraping involves accessing numerous websites and collecting data from them. However, we can limit ourselves to collect a large amounts of information from a single source and use it as a dataset.

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

So, I get motivated to do web scraping while working on my Machine-Learning project on Fake News Detection System. Whenever we begin a machine learning project, the first thing that we need is a dataset. While there are many datasets that you can find online with varied information, sometimes you wish to extract data on your own and begin your own investigation. I was needed with a dataset that I couldn't able to find anywhere according to my need.

So this motivated me to make my own Dataset for my project accordingly. And that's how I did my project from the scratch. My Project was basically based on classifying different news articles into two main categories FAKE & REAL.

FAKE-NEWS DATASET

For this project, The first task was to get a dataset which is already labeled with "FAKE", so this can be achieved by scraping data from some verified & certified news websites, on which we can rely on for fact of news articles and it is really a very difficult task to get genuine "FAKE NEWS".
I go through these news websites to get my FAKE-NEWS Dataset

But honestly speaking, I end up scraping data from one website i.e., Politifact.
And there is a strong reason to do so, As you go through the listed links up there, you will conclude that we needed a dataset with already labeled category i.e., "FAKE" but also we don't want our news articles to be in a modified form as such. We want to extract a raw news article without any keywords specifying whether the given news article in a dataset is "FAKE" or not.
So for example, If you go through the link "BoomLive.in", you will find that the news articles specifying "FAKE" are not in its actual form and altered on basis of some analysis of the fact-checking team. So this altered text on model training in ML will give us a biased result every time and the model that we made using this kind of dataset will result into a dumb one which can only predict news articles having keywords like "FAKE", "DID?", "IS?" in it and will not be going to perform well on a new testing set of data.
That's why we use Politifact to scrape our "FAKE-NEWS DATASET".
Even though there are challenges too in a labelling news article but we will be going to cover up that in a further section.

REAL-NEWS DATASET

The second task was to create a "REAL-NEWS" dataset, So that was easy if you are scrapping news-articles from trusted or verified news websites like "TOI", "IndiaToday", "TheHindu" & so many...So we can trust these websites that they are listing the factual/actual data and even if not, then we are assuming the same to be true and will train our model accordingly.
But for my project, I scrape data for real and fake from one website only (i.e., Politifact.com), since I am getting what I needed from it and also it is advisable when we are scraping data using python to use one website at a time. Although you can scrape multiple pages of that particular website altogether in one module by just running an outer for loop.

WHOM THIS ARTICLE IS USEFUL FOR?

Whoever is working on some project where you need to scrape data in thousands, this article is definitely for you 😃.It doesn't matter if you are from a programming background or not, because there are many times when people other than programmers from different backgrounds needed data as per their project, survey, or whatsoever purpose. But non-programmers find it difficult to understand any programming language, So I will make scrapping easy for them too by introducing some software from which they can scrape any kind of data in a huge amount easily. Although, Scraping using python is not that difficult if you follow along with me while reading this blog 😎, the only thing that you need to focus on is the HTML source code of a webpage. Once, you able to understand how webpages are written in HTML and able to identify attributes and elements of your interest, you can scrape any website 😋. For non-programmers, if you want to do web-scraping using python, just focus on HTML code mainly, python syntax is not that difficult to understand, It's just all libraries, some functions, and keywords that you needed to remember and understand. So I tried to explain every step with transparency, I hope at the end of this series, you will be able to scrape different types of the layout of webpages.

OVERVIEW

This post covers the first part: News articles web scraping using PYTHON. We’ll create a script that scrapes the latest news articles from different newspapers and stores the text, which will be fed into the model afterward to get a prediction of its category.

A brief introduction to webpage design and HTML:

If we want to be able to extract news articles (or, in fact, any other kind of text) from a website, the first step is to know how a website works.

We will follow an example to understand this:

When we insert an URL into the web browser (i.e. Google Chrome, Firefox, etc…) and access to it, what we see is the combination of three technologies:

HTML (HyperText Markup Language): it is the standard language for adding content to a website. It allows us to insert text, images, and other things to our site. In one word, HTML defines the content of every webpage on the internet.

CSS (Cascading Style Sheets): this language allows us to set the visual design of a website. This means it determines the style/presentation of a webpage including colors, layouts, and fonts.

JavaScript: JavaScript is a dynamic computer programming language. It allows us to make the content and the style interactive & provides a dynamic interface between client-side script and user.

Note that these three are programming languages. They will allow us to create and manipulate every aspect of the design of a webpage.

Let’s illustrate these concepts with an example. When we visit the Politifact page, we see the following:

If we disabled JavaScript, we would not be able to use this pop-up anymore, as you can see, we are not able to see a video pop up window now:

And If we deleted the CSS content from the webpage, we would see something like this:

So, At this point, I will be going to ask you a question.

"If you want to extract the content of a webpage via web-scraping, where do you need to look up?"

So, At this point, I hope you guys are clear about what kind of source code do we need to scrape.😎 Yeah, you are absolutely right, If you are thinking about HTML 😉

So, the last step before performing web scraping methods is to understand the bit of the HTML language.

HTML

HTML language is a "hypertext markup language" that defines the content of a webpage and constitute of elements and attributes, for scraping data, you should be familiar with inspecting those elements.
-An element could be a heading, paragraph, division, anchor tag & so many...
-An attribute could be that the heading is in bold letters.
These tags are represented with a opening symbol <tag> and closing symbol</tag>
e.g.,

<p>This is paragraph.</p> <h1><b>This is heading one in bold letters</b></h1>

Web-scraping using BeautifulSoup in PYTHON

Enough talk, show me the code.

Step-1 : Installing Packages

We will first begin with installing necessary packages:
beautifulsoup4
To install it, Please type the following code into your python distribution.

! pip install beautifulsoup4

BeautifulSoup under bs4 package is a library used to parse HTML & XML docs into python in a very easy & convenient way and access its elements by identifying them with their tags and attributes.

It is very easy to use yet very powerful package to extract any kind of data from the internet in just 5-6 lines.

requests
To install it, use the following command in your IDE or command shell.

! pip install requests

So as to provide BeautifulSoup with the HTML code of any page, we will need with the requests module.

urllib
To install it, use the following command:

! pip install urllib

urllib module is the URL handling module for python. It is used to fetch URLs(Uniform Resource Locator)

Although, here we are using this module for a different purpose, to call libraries like:

time(using which we can call sleep() function to delay or suspends execution for the given number of seconds.
sys(It is used here to get exception info like type of error, error object, info about the error.

Step-2 : Importing Libraries

Now we will import all the required libraries:
BeautifulSoup
To import it, use the following command onto your IDE

from bs4 import BeautifulSoup

This library helps us with getting HTML structure of any page that we want to work with and provides functions to access specific elements and extract relevant info.

urllib
To import it, type following command

import urllib.request,sys,time

urllib.request : It helps in defining functions & classes which help in opening URLs
urllib.sys : It's functions & classes helps us with retrieving exception info.
urllib.time :Python has a module named time which provides several useful functions to handle time-related tasks. One of the popular functions among them is sleep().

requests
To import it, just type import before this library keyword.

import requests

This module allows us to send the HTTP requests to web-server using python. (HTTP messages consist of requests from client to server and responses from server to client.)

pandas

import pandas as pd

It is a high-level data-manipulation tool that we needed to visualize our structured scraped data.

will use this library to make DataFrame(Key data structure of this library). DataFrames allow us to store and manipulate tabular data in rows of observations and columns of variables.

import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd

Step-3 : Making Simple requests

with the request module, we can get the HTML content and store into the page variable.
Make a simple get request(just fetching a page)

#url of the page that we want to Scarpe
#+str() is used to convert int datatype of the page no. and concatenate that to a URL for pagination purposes.
URL = 'https://www.politifact.com/factchecks/list/?page='+str(page)
#Use the browser to get the URL. This is a suspicious command that might blow up.
page = requests.get(url)

Since, requests.get(url) is a suspicious command and might throw an exception, we will call it in a try-except block

try:
     # this might throw an exception if something goes wrong.
     page=requests.get(url) 
     # this describes what to do if an exception is thrown 
except Exception as e: 
    # get the exception information
    error_type, error_obj, error_info = sys.exc_info()      
    #print the link that cause the problem
    print ('ERROR FOR LINK:',url)
    #print error info and line that threw the exception                          
    print (error_type, 'Line:', error_info.tb_lineno)
    continue

We will also use an outer for loop for pagination purposes.

Step-4 : Inspecting the Response Object

I. See what response code the server sent back (useful for
detecting 4XX or 5XX errors.

  page.status_code

Output:

The HTTP 200 OK success status response code indicates that the request has succeeded.

II. Access the full response as text(get the HTML of the page in a big string)

  page.text

Output:

It will return HTML content of a response object in Unicode.
Alternative:

  page.content

Output:

Output:

whereas, It will return the content of response in bytes.

III. Look for a specific substring of text within the response.

     if "Politifact" in page.text:
         print("Yes, Scarpe it")

IV. Check the response’s Content Type (see if you got back HTML,
JSON, XML, etc)

  print (page.headers.get("content-type", "unknown"))

Output:

Step-5 : Delaying request time

Next with the time module, we can call sleep(2) function with a value of 2 seconds. Here it delayed sending requests to a web-server by 2 seconds.

time.sleep(2)

The sleep() function suspends execution of the current thread for a given number of seconds.
Next, we need to create

Step 6 : Extracting Content from HTML

Now that you’ve made your HTTP request and gotten some HTML content, it’s time to parse it so that you can extract the values you’re looking for.

A)Using Regular Expressions
Using Regular Expressions for looking up HTML content is strongly not recommended at all.

However, regular expressions are still useful for finding specific string patterns like prices, email addresses, or phone numbers.

Run a regular expression on the response text to look for specific string patterns:

import re  # put this at the top of the file
...
print(re.findall(r'\$[0-9,.]+', page.text))

Output:

B)Using BeautifulSoup's object Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work

soup = BeautifulSoup(page.text, "html.parser")

Below listed command will Look for all the tags e.g.,<li> with specific attribute 'o-listicle__item'

links=soup.find_all('li',attrs={'class':'o-listicle__item'})

INSPECTING WEBPAGE
For being able to understand above code, you need to inspect the webpage & please do follow along:
1)Go to listed URL above
2)press ctrl+shift+I to inspect it.
3)This is how your 'Inspect window' will look like:

press ctrl+shift+C to select an element in the page to inspect it or go to the leftmost arrow in header of Inspect window. 4)For getting above specific element & attribute in inspect window
First, tries to go to every section of the webpage, & see changes on your inspect window, you will easily grasp the idea behind how webpages are working and which element is what and what particular attribute is contributing to the webpage.
When done with the above step, now I am assuming that you can understand the working of the above element<li> and it's attribute.
Since, I needed with the news section of a particular article, I go to that article section by selecting the inspect element option in the inspect window, It will highlight that article section on the web-page and it's HTML source on Inspect Window. Voila!✨

Did you able to locate the same tag on your machine?

If yes, You are all set to understand every bit of HTML tags whatsoever I have used in my code.

Continuing with my code: 😅

print(len(links))

This command will help you to inspect how many news articles are there on a given page.
Help you understand accordingly, up to what level you need to paginate your loop for extracting huge data.

Step-7 : Finding elements and attributes

Look for all anchor tags on the page (useful if you’re building a crawler and need to find the next pages to visit)

links = soup.find_all("a")

It will find a division tag under <li> tag where div tag should contain listed or specific attribute value. Here 'j' is an iterable variable that is iterating over response object 'Links' for all news articles listed on a given page.

Statement = j.find("div",attrs={'class':'m-statement__quote'})

text.strip() function will return text contained within this tag and strip any kind of extra spaces, '\n','\t' from the text string object.

   Statement = j.find("div",attrs={'class':'m- 
   statement__quote'}).text.strip()

Voula! 🌟 We have scraped the first attribute i.e., Statement of our dataset 😋

In the same division section, It will look for anchor tag and return with the value of hypertext link. Again, strip() function is used to get our values organized so that our CSV file looks good.

Link=j.find("div",attrs={'class':'m-statement__quote'}).find('a')['href'].strip()

For getting Date attribute, you need to inspect web-page first, As there is a string contained with it. So calling text function without specifying indexing, you will get something like this But we don't need text other than the date, So I use indexing. Although you can clean your attribute later using some regex combinations. 'footer' is an element that contained the required text.

Date = j.find('div',attrs={'class':'m-statement__body'}).find('footer').text[-14:-1].strip()

Here also, I have done everything same as before except get(), which is extracting content of a attribute passed(i.e., title)

Source = j.find('div', attrs={'class':'m-statement__author'}).find('a').get('title').strip()

Since, For my project, I needed a dataset that is not already altered and also, I need to know already about thousands of articles that lie in which category for my training data. and No-one can do that manually. So, On this website, I do find articles attached already with labels but the text is not retrievable because it is contained in an image. For this kind of specific task, you can use get() to retrieve particular text effectively. Here, I am passing 'alt' as an attribute to get(), which contains our Label text.

Label = j.find('div', attrs ={'class':'m-statement__content'}).find('img',attrs={'class':'c-image__original'}).get('alt').strip()

In below lines of code, I have put all concepts together & tried to fetch details for five different attributes of my Dataset.

for j in links:
        Statement = j.find("div",attrs={'class':'m-statement__quote'}).text.strip()
        Link=st.find('a')['href'].strip()
        Date = j.find('div',attrs={'class':'m-statement__body'}).find('footer').text[-14:-1].strip()
        Source = j.find('div', attrs={'class':'m-statement__author'}).find('a').get('title').strip()
        Label = j.find('div', attrs ={'class':'m-statement__content'}).find('img',attrs={'class':'c-image__original'}).get('alt').strip()
        frame.append([Statement,Link,Date,Source,Label])
upperframe.extend(frame)

Step-8:Making Dataset

Append each attribute value to a empty list 'frame' for each article

frame.append([Statement,Link,Date,Source,Label])

Then, extend this list to an empty list 'upperframe' for each page.

upperframe.extend(frame)

Step-9 : Visualising Dataset

If you wanted to visualise your data on jupyter, you can use pandas dataframe to do so.

data=pd.DataFrame(upperframe, columns=['Statement','Link','Date','Source','Label'])
data.head()

Step-10 : Making csv file & saving it to your machine

A) Opening & writing to file
The below command will help you to write csv file and save it to your machine in the same directory as where your python file has been saved in

    filename="NEWS.csv"
    f=open(filename,"w")
    headers="Statement,Link,Date, Source, Label\n"
    f.write(headers)
    ....
        f.write(Statement.replace(",","^")+","+Link+",
"+Date.replace(",","^")+","+Source.replace(",","^")+","+Label.replace(",","^")+"\n")

This line will write each attribute to a file with replacing any ',' with '^'.

f.write(Statement.replace(",","^")+","+Link+","+Date.replace(",","^")+","+Source.replace(",","^")+","+Label.replace(",","^")+"\n")

So, when you run this file on command shell, It will make a CSV file in your .py file directory.
On opening it, you might see weird data if you don't use strip() while scraping. So do check it without applying strip() and if you don't replace '^' with ',', It will also look weird.
So replace it using these simple steps:

open your excel file (.csv file)
press ctrl+H (a pop-up window will come asking about find what & replace with)
give '^' value to 'find what' attribute and give ',' value in 'replace with' attribute.
press Replace All
click Close & Wohoo!😍 You are all done with having your dataset in perfect form. and don't forget to close your file with the following command after done with both the for loops,

f.close()

and running the same code again and again might throw an error if it has already created a dataset using the file writing method.

B) converting dataframe into csv file using to_csv()
So, instead of this lengthy method, you can opt for another method: to_csv() is also used to convert dataframe into a csv file and also provide with a attribute to specify path.

path = 'C:\\Users\\Kajal\\Desktop\\KAJAL\\Project\\Datasets\\'
data.to_csv(path+'NEWS.csv')

To avoid the ambiguity and allow portability of your code you can use this:

import os
data.to_csv(os.path.join(path,r'NEWS.csv'))

this will append your CSV name to your destination path correctly.

SUGGESTION & CONCLUSION

Although I will suggest using the first method using open file and writing to it and then close it, I know it is a bit lengthy & tacky to implement but at least it will not provide you with ambiguous data as to_csv method mostly does.

See in the above image, how it extracts ambiguous data for the Statement attribute.
So, instead of spending hours cleaning your data manually, I would suggest writing a few extra lines of code specified in the first method.
Now, you are done with it.✌️

IMPORTANT NOTE: If you tried to copy-paste my source code for scraping different websites & run it, It might possible that it will throw an error. In fact, It will definitely throw an error because each webpage's layout is different & for that, you need to make changes accordingly.

I hope you will find it useful and liked my article.😇 Please feel free to share your thoughts and hit me up with any queries you might have.😉

Full Code

The Dataset:

This article is the first part of the series of web-scraping and for those who come from non-technical backgrounds, read the second part of this series here.

I hope you will find it useful and liked my article.😇 Please feel free to share your thoughts and hit me up with any queries you might have. You can reach me via following :

Subscribe to my YouTube channel for video contents coming soon here
Connect and reach me on LinkedIn