Scrape Github User details with Python.

#python #github #scrape #watercooler

When i was learning web scraping , one of the ideas that came to my mind is a Github Scraper.
Here i will try my best to describe each process.

Lets start..

We have to install a couple of packages first.

Beautifulsoup
requests
htmlparser

pip install requests
pip install html5lib
pip install beautifulsoup4

Then open https://github.com/yourusername
Open Devtools.
This is what i see when i open my dashboard and devtools.
While we scrape web , we need the element's id ,classname or xpath to scrape it.
We will be scraping Name, Username , No of Repos, Followers , Following and profile image.

import requests
from bs4 import BeautifulSoup
import html5lib

Import the modules.

r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')

Make a request into the website.
Parse the html recieved as response in r.content using beautifulsoup and html5lib.
From here we are starting scraping.

namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()

Here we are getting all element in the element of class name vcard-names pl-2 pl-md-0"
Name and Username are in the span element in the above div.
We have assigned the content into namediv variable.
We are finding all span elements and selecting (0:name,1:Username) and getting the text using getText() function.

statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')

Here the same thing happens.
Followers,Following,Stargazers are inside element of classname flex-order-1 flex-md-order-none mt-2 mt-md-0 and in mb-3 which is inside that.
Lets get that and store it in elements variable.
Getting the span inside inside the elements returns a list.
- Followers is having index=0
- Following is having index=1
- Stargazer is having index=2

elements.find_all('a')[2].find('span').getText().strip(' ')

Here we are getting the second index item in a element and then getText() from the span inside it. We are using strip('') to remove unneccesary blank spaces in the result.

u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']

The above code gives the image tag and we are getting the src attribute.

repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()

Here we are getting the no of repos user haves.
That is all you need to scrape user details with python.

Source Code

import requests
from bs4 import BeautifulSoup
import html5lib
r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')
namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()
statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()

The idea is that, we should make the program to navigate to the element we want and select the required element.
Refer some beautifulsoup methods here
I have also made a pypi module to scrape Github.See it here and give a star if you like it.

If you have any doubts or need clarification, comment down below.

Stay tuned for part 2 where we will scrape the user repo details.