DEV Community

Cover image for Scrape Github User details with Python.
Fredy Somy
Fredy Somy

Posted on

4 3

Scrape Github User details with Python.

When i was learning web scraping , one of the ideas that came to my mind is a Github Scraper.
Here i will try my best to describe each process.

Lets start..

We have to install a couple of packages first.

  • Beautifulsoup
  • requests
  • htmlparser
pip install requests
pip install html5lib
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

  • Then open https://github.com/yourusername
  • Open Devtools. Alt Text
  • This is what i see when i open my dashboard and devtools.
  • While we scrape web , we need the element's id ,classname or xpath to scrape it.

  • We will be scraping Name, Username , No of Repos, Followers , Following and profile image.

import requests
from bs4 import BeautifulSoup
import html5lib
Enter fullscreen mode Exit fullscreen mode
  • Import the modules.

r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')
Enter fullscreen mode Exit fullscreen mode
  • Make a request into the website.
  • Parse the html recieved as response in r.content using beautifulsoup and html5lib.

  • From here we are starting scraping.


namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()
Enter fullscreen mode Exit fullscreen mode
  • Here we are getting all element in the element of class name vcard-names pl-2 pl-md-0"
  • Name and Username are in the span element in the above div.
  • We have assigned the content into namediv variable.
  • We are finding all span elements and selecting (0:name,1:Username) and getting the text using getText() function.

statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
Enter fullscreen mode Exit fullscreen mode
  • Here the same thing happens.
  • Followers,Following,Stargazers are inside element of classname flex-order-1 flex-md-order-none mt-2 mt-md-0 and in mb-3 which is inside that.

  • Lets get that and store it in elements variable.

  • Getting the span inside inside the elements returns a list.

    • Followers is having index=0
    • Following is having index=1
    • Stargazer is having index=2
elements.find_all('a')[2].find('span').getText().strip(' ')
Enter fullscreen mode Exit fullscreen mode
  • Here we are getting the second index item in a element and then getText() from the span inside it. We are using strip('') to remove unneccesary blank spaces in the result.
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
Enter fullscreen mode Exit fullscreen mode
  • The above code gives the image tag and we are getting the src attribute.
repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()
Enter fullscreen mode Exit fullscreen mode
  • Here we are getting the no of repos user haves.

  • That is all you need to scrape user details with python.

    Source Code

import requests
from bs4 import BeautifulSoup
import html5lib
r=requests.get("https://github.com/fredysomy")
soup=BeautifulSoup(r.content,'html5lib')
namediv=soup.find("h1" ,class_="vcard-names pl-2 pl-md-0")
name=namediv.find_all('span')[0].getText()
u_name=namediv.find_all('span')[1].getText()
statstab=soup.find(class_="flex-order-1 flex-md-order-none mt-2 mt-md-0")
elements=statstab.find(class_="mb-3")
followers=elements.find_all('a')[0].find('span').getText().strip(' ')
following=elements.find_all('a')[1].find('span').getText().strip(' ')
totstars=elements.find_all('a')[2].find('span').getText().strip(' ')
u_img=soup.find(class_="avatar avatar-user width-full border bg-white")['src']
repo_num=soup.find(class_="UnderlineNav-body").find('span',class_="Counter").getText()
Enter fullscreen mode Exit fullscreen mode
  • The idea is that, we should make the program to navigate to the element we want and select the required element.
  • Refer some beautifulsoup methods here

  • I have also made a pypi module to scrape Github.See it here and give a star if you like it.

If you have any doubts or need clarification, comment down below.

Stay tuned for part 2 where we will scrape the user repo details.

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs