Web Scraping Series : Using Python and Software |
---|
1. Scraping web pages without using Software : Python |
2. Scraping web Pages using Software : Octoparse |
INTRODUCTION
WHY THIS ARTICLE?
This article is a second of series of Web-scraping series....
AS I mentioned before in my first article, that I choose to write article about scraping because during building my project Fake-News Detection System, It took me days to research for it accordingly, As I wasn't able to find dataset according to my need.
So, If you didn't go through my first article, I would strongly recommend to go through that once and If you have a programming background, then you must read first article of this series.
WHOM THIS ARTICLE IS USEFUL FOR?
Since, for users having programming background, I have already written a blog and who is having knowledge about python in specific,I would suggest to do scraping using that instead of any software because I find it easy to do it using python as compare to spend days on understanding interface of any particular software.
But the people out there, who don't have any programming background in particular, you can follow along with me and get familiar with the interface & working of this software.
OVERVIEW
This article covers the second part of the series, Scraping web-pages using software : Octoparse.
However, there are many softwares that you can found easily on internet for automating the purpose like
ParseHub, ScarpeSimple, Diffbot, Mozenda.
Brief Introduction to diff automate softwares:
1.ParseHub:
Website: https://www.parsehub.com/
Purpose: Parsehub is a phenomenal tool for building web scrapers without coding to extract tremendous data. It is used by data scientists, data journalists, data analysts, E-commerce websites,job boards, marketing & sales, finance & many more.
Features: It's interface is dead simple to use, you can build web scrapers simply by clicking on data that you want.It then exports the data in JSON or Excel format. It has many handy features such as automatic IP rotation, allowing scraping behind login walls, going through dropdowns and tabs, getting data from tables and maps, and much much more. In addition, it has a generous free tier, allowing users to scrape up to 200 pages of data in just 40 minutes! Parsehub is also nice in that it provides desktop clients for Windows, Mac OS, and Linux, so you can use them from your computer no matter what system you’re running.
2.ScrapeSimple:
Website: https://www.scrapesimple.com
Purpose: ScrapeSimple is the perfect service for people who want a custom scraper built for them. Web scraping is made as simple as filling out a form with instructions for what kind of data you want.
Features: ScrapeSimple lives up to its name with a fully managed service that builds and maintains custom web scrapers for customers. Just tell them what information you need from which sites, and they will design a custom web scraper to deliver the information to you periodically (could be daily, weekly, monthly, or whatever) in CSV format directly to your inbox. This service is perfect for businesses that just want a html scraper without needing to write any code themselves. Response times are quick and the service is incredibly friendly and helpful, making this service perfect for people who just want the full data extraction process taken care of for them.
3.Diffbot:
Website: https://www.diffbot.com
Purpose: Enterprises who who have specific data crawling and screen scraping needs, particularly those who scrape websites that often change their HTML structure.
Features: Diffbot is different from most page scraping tools out there in that it uses computer vision (instead of html parsing) to identify relevant information on a page. This means that even if the HTML structure of a page changes, your web scrapers will not break as long as the page looks the same visually. This is an incredible feature for long running mission critical web scraping jobs. While they may be a bit pricey (the cheapest plan is $299/month), they do a great job offering a premium service that may make it worth it for large customers.
4.Mozenda:
Website: https://www.mozenda.com/
Purpose: Enterprises looking for a cloud based self serve webpage scraping platform need look no further. With over 7 billion pages scraped, Mozenda has experience in serving enterprise customers from all around the world.
Features: Mozenda allows enterprise customers to run web scrapers on their robust cloud platform. They set themselves apart with the customer service (providing both phone and email support to all paying customers). Its platform is highly scalable and will allow for on premise hosting as well. Like Diffbot, they are a bit pricy, and their lowest plans start at $250/month.
- Although I am going to talk about Octoparse in detail in this article, since I have used that only.
OCTOPARSE
Website: https://www.octoparse.com/
Purpose: Octoparse is a fantastic tool for people who want to extract data from websites without having to code, while still having control over the full process with their easy to use user interface.
Features: Octoparse is the perfect tool for people who want to scrape websites without learning to code. It features a point and click screen scraper, allowing users to scrape behind login forms, fill in forms, input search terms, scroll through infinite scroll, render javascript, and more. It also includes a site parser and a hosted solution for users who want to run their scrapers in the cloud. Best of all, it comes with a generous free tier allowing users to build up to 10 crawlers for free. For enterprise level customers, they also offer fully customized crawlers and managed solutions where they take care of running everything for you and just deliver the data to you directly.
Step by Step explanation to extract data from 1000's of news articles
Step-1: Download Octoparse
- Go to website : https://www.octoparse.com/download
- and follow along the guidelines of community.
Step-2: Sign-up
- After completing with downloading & installing, sign up for account if you haven't created before.
Step-3: Explore it
- Before start on your own, I would strongly recommend you to please explore different sections of it that will be ultimately going to help you in interacting with this interface while working on it later on.
- Go through the popular template section, there are some popular templates of popular websites and you might find your required data there.
- Go through the tutorials on both template mode & advanced mode
Step-4: Enter URL
If you want to scrape data from just one website, you can simply paste your copied url at home page and click start.
But If you want to scrape data from more than one website. then, go to NEW tab & then, click ADVANCED option.
-
You will see a new window like this, in which you can easily organised your work with more advanced options and track your directories too.
- So, you can upload upto 10k different URLs here but the condition is layout of these different URLs should be same to be able to extract data all-together, otherwise it will automate the process but give you separate result for different website's layout. So you will not get all data merge together.
Step-5: Specifying scraping details & attributes
Click save and you will see a window like this:
Where, left section is for maintaining workflow, center will display web-page of first url that you have entered and below section will show data preview.
- In there, you have two options to move further: Auto-detect web-page data
You would either choose to "Auto-detect web page data" that will scrape imp features according to it's understanding and return you with five different results.In which you can choose to skip what is of no use to you or keep it if it scraped all attributes as per your wish.
In the below dialog box, you can choose to do editing as per your need.
So, In above image you are provided with 3 options that you can enable or disable as per your need.
First: Scroll down the page to load more data. So if a web-page is not divided into different pages, It might be possible that all of the data is on the same page & for extracting all data all you need is to enable "load more data" option.
-
Second: Click the next button to capture multiple pages
Enabling it will do pagination upto pages that you select as a "next button".- So it will allow you to check or edit next button. On clicking on check, you will see next button highlighted on web-page section that is detected automatically.
- And If it is not detected correctly automatically, click on edit > now on click on anything on web-page screen that you wanted to detect as a "next" button. So, for instance, suppose there is no "next" button instead it might be ">" working as "next button" or if you don't want to automate the process of scraping till the last page of web-page, you can choose to paginate upto a specific page like: "1", "2","3" & so on...
-
Third: Click the state_url to capture data on the page that follows
- It will allow you to capture content or text of a page that follows and make another attribute containing text(means content of the page opens up on clicking on particular url).
-
There is one more option in TIPS, "Switch auto-detect results(1/5)", So on clicking this link, you will able to see 5 different sets of auto detected datasets.You can keep according to your need.
- After being done with the editing click on "save settings"
- You can see scraped results in "Data Preview" and edit "Attribute names" too.
- you will see changes in your workflow as follows:
Edit task workflow manually
- Or you can choose to edit workflow manually as per your need & select specific element from web-page to be shown in your dataset as attributes.
- In there, on pointing on down-arrows, you will find a "+" sign to add elements as per your need.
- It will help you to be specific & organized with your work. As in previous case of auto-detecting data, there are too many irrelevant attributes that it scraped automatically. So to be specific with what you need , I would suggest to opt for second option.
It also allow you to rename, delete or edit any specific element or you can choose to change settings as per need.
I will demonstrate you with an example of extracting 1000's of articles having 6 attributes: News Headline, Link, Source, Stated on, Date, Image_url.
Extract data manually
- To extract all of them, go to web-page section > select specific details of very first article like: "News Headline", "Link of news", "Source of news", "Stated On", "Date" > by simply clicking on these items on very first article > the portion will be highlighted as shown in below window:
Data extraction for all news-articles on first page of listed url
- then, choosing "Select all" option will select same specific details of each article upto the last of web-page. you will see captured 30 lines in data-preview as below:
-
Now, clicking on "Extract data" option will extract all details of all articles listed on 1st page of entered url.
- You can now see changes in the workflow:
- a dialog-box will open to ask for extract more elements:
- Since, we haven't scraped image_url, we will select it separately and same procedure as listed above:
- select image on web-page section:
- A dialog-box will come with diff options > Select "Select All" option.
- Another pop-up window will come asking for diff options to choose > select "Extract image urls"
-
and you are all done with scraping image urls as well for all news article of first page & It will add a new attribute to your data-preview.
- This is how your data-preview will look like after editing attribute names:
- Again a pop-up window will come asking to extract more element, since we wan't to extract data from more than one page, we will do pagination.
Pagination
- Now, if you are needed with a huge data, you can loop over till the specific page or to the last page of listed url.
- To do pagination, all you need is to search for the keyword that is specifying next page for particular web-page like: "next", ">", or anything. > click on that keyword > In my case, It is "next" button itself > click on it > It will highlighted & a new window will pop-up.
- select "Loop click next page"
- After you are done, your workflow will now look something like auto one's workflow.
- When you are done with all editing and have organized data, click save & run.
Step-6: Exporting data to your machine
- Clicking on save & Run option will open up with new window listing 3 options to choose how to run it.
If you are a premium user, only then you can access last two options which is provided with features like scraping any website on daily basis, weekly basis, monthly basis & many more other features. Their server will take care of your data and sending it to you after organizing it on the basis whatever you choose to do.
If you are not a free user, select first option, "Run task on your device"
It will start extracting all of data, & also you need to pay a little attention while it is extracting data although it is done by automating, because if data limit exceeds 10k, It will stop & you will have to sit for another hour waiting to extract it from scratch.
You are also needed to wake your system up because if your screen sleeps while extracting data , It might possible that it will stop extracting data at that point & again you will have to start it again to extract more or as much as you can.
- Choose "export data":
- choose format to save your file:
- So I choose to extract my file as .csv file & save it to my desktop & don't close this window, we will use it to extract data as .xlsx format again.
- Now, let's have a look at data that we have extracted.
Oh-oh! It's in suspicious form that is not readable & organized at all
Let's go to the export window again: export data > choose .xlsx format this time > click ok.
- Now, let's have a look at exported data in .xlsx format.
Voila! Now, It does make sense, all ambiguity is now removed from our data.
Step-7: Formatting Excel file using formulas
Inspecting Dataset
- first thing you will do is inspect your Dataset, on inspecting my Dataset, I found some irrelevant things that I couldn't able to edit at time of scraping.
So we will do some formatting task on excel file.
I. Look at my "Image url" attribute, So I extract image url, to extract label from it, Since Label value is written within image url, I didn't find any better option to extract it.
- So if you notice "Image url" attribute closely, there is a small string ".jpg" & a bigger string "https://static.politifact.com/img/meter-" which is common for all rows of "Image url" attribute.
-
So, We will replace both the strings with "" to get my label values.
- press ctrl+H > fill field "find what" with ".jpg" & then, fill field "replace with" with ""(means you don't have to specify anything there) > click "Replace All" > press "OK" & you are done with your labels.
This is how your attribute will look like after formatting,
- As you can see there are two more problems with it, first is value in first cell is a hyperlink . second is the extra spaces.
To "remove hyperlinks" from a particular cell > right click on it > select "remove hyperlinks" from a drop down. And to remove from entire column > select entire column > right click on it > select "remove hyperlinks" from a drop down.
To remove "extra spaces" from a particular attribute > go to any empty cell > write formula : =TRIM(address of the first cell of attribute) > press enter > you will see formatted value of first cell > to apply changes to all cell > drag & drop first cell upto the last cell of particular attribute. > you will see all values inserted in a format applied > now replace new column with old column > by selecting new column entirely > copy it > then, select old column entirely where you need to paste it > go to paste options > select paste values(V) option from drop-down.
fantastic! you are all done with the "Label" column.
Have a look now:
II. Look at my other attribute "Stated On", from which data of my concern is only date.
To delete text other than that, we will do it in two steps:
- As you already have seen that for similar pattern of strings repeating throughout the column, we can use previous pattern of finding it & replacing it with nothing. So, for sub-string "stated on" we will replace it with nothing.
This is how our column will look like:
- On noticing above image, you will find the other sub-string is different for all rows of this column, then what to do with that, Since we only wanted date, we will extract it this time:
Let's see : In a new empty cell > type command "=MID(address of first cell of attribute stated on, starting index of string that you want to extract, number of characters upto what you want to extract" > press enter > you will see first formatted value > then, repeat the steps above to change all values & replace it with new values.
Have a look at new "Stated on" attribute:
III. In "Date" attribute, we don't want text other than date, also we can't use "mid" formula here, as the date is specified at suffix of a string & there is no clarity about starting point, as it is changing for all diff cells.
So, we will going to achieve this task by using "RIGHT":
Go to new empty cell > type command "RIGHT(E2,LEN(E20-FIND("•",E2))" > press enter > and do steps same as above to replace new column inserted values with that of old values.
Our Final Dataset after all formatting:
So, This data is now all cleaned & ready to use, I hope you guys will find this article informative & useful for you. Do share your thoughts about it in comment box & do let me know, if you might have any queries. ✌️
You can reach me via following :
Subscribe to my YouTube channel for video contents coming soon here
Connect and reach me on LinkedIn
Top comments (0)