DEV Community

Cover image for Web Scrapping With Python.
FRANCIS ODERO
FRANCIS ODERO

Posted on • Edited on

Web Scrapping With Python.

Suppose you want some data of a product from a company? Let's say the price of all commodities to be in a comma separated value(CSV) or photos from a social media! what will you do?
Actually, you can copy information from the respective site and paste it into your own file. But what if you want to get a huge amount of information from the site as soon as possible? Such as large amounts of data from a website to train a Machine Learning algorithm?
In that case, copy and paste will not work! And then you will need to use Web Scraping.Web scraping uses intelligence automation methods to get thousands or even millions of data sets in a smaller amount of time.

What is Web Scraping?

Web scraping is a means of extracting vast volumes of data from websites in an automated manner. The majority of this data is unstructured HTML data that is converted to structured data in a spreadsheet or database before being used in various applications.
To gather data from websites, web scraping can be done in a variety of methods. These options include leveraging internet services, specific APIs, and even writing your own web scraping code from scratch. Many huge websites, such as Google, Twitter, Facebook, StackOverflow, and others, provide APIs that let you access their data in a structured fashion.

Application of web scrapping

  1. Market research
  2. Price monitoring
  3. News monitoring
  4. Email marketing
  5. Sentiment Analysis

Prerequisites

  • Python

Why python๐Ÿค”, since it is the most popular language for web scraping as it can handle most of the processes easily. It also has a variety of libraries that were created specifically for Web Scraping that is scrapy and beautiful soup.

So let's start ๐Ÿ˜€๐Ÿ˜€๐Ÿ˜€๐Ÿ’ช๐Ÿ’ช

1. Installing of python.

Install python 3 and virtualenv then make virtual environment.

Install python 3 first by running following line of code in terminal:

$ sudo apt install python3

Then install virtual environment, in our terminal type in:

$ sudo apt install python3-venv

After installing python and virtualenv, create a folder and virtualenv then activate the created virtualenv.

  • Create project folder:

mkdir web_scrap

So lets go to the inside of web_scrap directory :

cd web_scrap

  • Create virtualenv:

virtualenv env

  • activate virtualenv:

. env/bin/activate

Image description

This are basic steps to setup our coding environment, check out this for more.

2. Create python file.

Create a python file scrap.py and open it in visual studio or on your favorite text editor.

Image description

3. Import packages.

Download and import packages in the virtual environment.

pip install requests

pip install bs4

pip install termcolor

The python modules that will be using:

  1. re - regular expression.
  2. requests- to scrap data directory from Instagram.
  3. beautifulSoup - to get specific filtered part from all data.
  4. urllib - to use request to download from url.
  5. os - to store downloaded file in our media folder.

Image description

4. Get website link.

Let's add a simple input system to get any url as an input url:

url = input("enter here your url from instagram")

Get any url from Instagram then get data from the url using requests.

data = requests.get(url)

You can print the data and check the results.

print(data)

The codes

Image description

The outuput

Image description

Now let's take a case for a video.

https://www.instagram.com/p/B_wH2aCnyEh/?utm_medium=copy_link

This is the page with the video.

Image description

And here is the source code.

Image description

And In This Page If you just find(by ctrl + F) โ€˜mp4โ€™ . Then You will find something like this:

Image description

The link that contain the mp4 is the main thing we need:

"https://instagram.fnbo9-1.fna.fbcdn.net/v/t50.2886-16/95332972_323221645317471_817729865566514230_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjQ4MC5mZWVkLmRlZmF1bHQiLCJxZV9ncm91cHMiOiJbXCJpZ193ZWJfZGVsaXZlcnlfdnRzX290ZlwiXSJ9\u0026_nc_ht=instagram.fnbo9-1.fna.fbcdn.net\u0026_nc_cat=103\u0026_nc_ohc=Q1fkDGBA2oEAX9xsGin\u0026edm=AABBvjUBAAAA\u0026vs=18035297806253182_2714272676\u0026_nc_vs=HBksFQAYJEdHeXFyZ1ZmVlZybjl5VUJBRGJzVWUtNktGa0xia1lMQUFBRhUAAsgBABUAGCRHSFlhdkFWNG9oRUFsSEFHQVAwaFlDdDdtOVl0YmtZTEFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMBUAACb8yIvTv8CJQBUCKAJDMywXQCbul41P3zsYEmRhc2hfYmFzZWxpbmVfMV92MREAdeoHAA%3D%3D\u0026ccb=7-4\u0026oe=621DCC10\u0026oh=00_AT_7jbU74b8Fm9-U5y6GQhURJihmzKNI_AEvVNjI4e-Blw\u0026_nc_sid=83d603"

Due to Instagram terms instead use the below link for video:

https://www.w3schools.com/html/movie.mp4

match = re.findall(rโ€™url\W\W\W([-\W\w]+)\W\W\Wvideo_view_countโ€™, str)

What the code above does is to find the url above whenever we run the code.

To extract the video we have to declare a variable name extraction and inside this variable we will store the file format for video, as shown below.

extraction = โ€œ.mp4โ€

Also do the same for image but use profile_pic_url:

"https://instagram.fnbo9-1.fna.fbcdn.net/v/t51.2885-19/274607143_1204294113308064_418123174948225933_n.jpg?stp=dst-jpg_s150x150\u0026_nc_ht=instagram.fnbo9-1.fna.fbcdn.net\u0026_nc_cat=100\u0026_nc_ohc=L3oR46dvCW0AX-fS68k\u0026edm=AABBvjUBAAAA\u0026ccb=7-4\u0026oh=00_AT_7whkb_tXXNikAlnrI8yBifCb9zDwZK0Zt5q462q93Vw\u0026oe=6222855B\u0026_nc_sid=83d603"

as shown below.

Image description

source code :

Image description

search profile_pic_url:

Image description

For image link use:

https://www.w3schools.com/html/pic_trulli.jpg

match = re.findall(r'profile_pic_url\W\W\W([\W\w]+)\W\W\Wdisplay_resourcesโ€™, str)

And Now Our extraction variable value is :

extraction = โ€œ.jpgโ€

So last line of this step is to collect the actual post video or imageโ€™s url in a variable as a regular exp. array to string. To do that :

res = match[0]

5. Data extraction.

Here we have to download and get the caption of the post.

We will use BeautifulSoup in our code to get the caption or title of the post.We have to assign all data (str) to pass through BS4 and filter it .

page = BeautifulSoup(str, "html.parser")
title = page.find("title")
title = title.get_text()

So the code will find the title of this page and store the title varible.
After this we have to perform regular expression to make our file name saved and also store in a media folder.

title = re.sub(r"\W+", "_", title)
title = "download/web_scrap"+title+"web_scrap"
print("\n"+title)

We use download/ because we want to store our downloaded file in a new folder called download/.

if res != "" :
print('found \n \n'+'\033[1m'+colored(res, 'green')+'\033[0m'+'\n') #'found word:cat'
 download = input("Do you want to download(y/N) : ")
if (download == "y" or download == "Y"):
  try:
   fileName = title
   print("Downloading.....")
   DFU.urlretrieve(res, fileName+extraction)
   print("Download Successfully!")
   os.system("tree download")
except:
   print("Sorry! Download Unsuccessful")
else:
 print('did not find or post is from private account')
 exit()
Enter fullscreen mode Exit fullscreen mode

So if res variable is not empty then print the actual link of post.Then make a input and this input will ask you that you want to download this file or not.To do so, answer with y or n .If answer is Y or y then it will continue working.

if (download == โ€œyโ€):

That's all on how to download an image and a video from a social media Instagram.

Get the source code here

THank you for taking your time to go through this article.

KEEP MOVING ON ๐Ÿ’ช๐Ÿ’ช๐Ÿ’ช๐Ÿ’ช๐Ÿ’ช๐Ÿ’ช

HAPPY CODING

Top comments (8)

Collapse
 
ahdesignuae profile image
AH Design • Edited

Thanks for sharing. I'm learning programming languages and hopefully one day i might be good programmer. I have my personal website launched as I'm a freelance web designer in Dubai. Appreciate work you put on to post this.

Collapse
 
oderofrancis profile image
FRANCIS ODERO

Thank you so much for taking your time to read my article I appreciate

Collapse
 
dunnyk profile image
dunnyk

A good doc, keep informing.
if (download == "y" or download == "Y"):
in the above code I think you can reduce it by taking in any input whether Y or y, then convert it to lower.
if download.lower() == 'y':
do something...
This can replace the below statement saving you more...
if (download == "y" or download == "Y"):

Collapse
 
oderofrancis profile image
FRANCIS ODERO

on it thank you

Collapse
 
brayan_kai profile image
Brayan Kai

Great article ๐Ÿ‘Œ Here , really insightful Keep it up ๐Ÿ‘๐Ÿ‘๐Ÿฅณ

Collapse
 
oderofrancis profile image
FRANCIS ODERO

Thank you Brayan

Collapse
 
ats1999 profile image
Rahul kumar

I have built a tool for content creators to generate open graph images for social media posts.

see -> og-image-client.vercel.app

Must check it out

Collapse
 
mainashem profile image
SHEM MAINA

Great work