DEV Community

Cover image for Scrapping Top Repositories for GitHub Topics
Muchamad Faiz for Zetta

Posted on

Scrapping Top Repositories for GitHub Topics

GitHub is a popular website for sharing open source projects and code repositories. For example, the tensorflow repository contains the entire source code of the Tensorflow deep learning framework.

Repositories in GitHub can be tagged using topics. For example, the tensorflow repository has the topics python, machine-learning, deep-learning etc.

The page https://github.com/topics provides a list of the top topics on Github. In this project, we'll retrive information from this page using web scraping: the process of extracting information from a website in an automated fashion using code. We'll use the Python libraries Requests and Beautiful Soup to scrape data from this page.

Project Outline

  • 1. We're going to scrape https://github.com/topics
  • 2. We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
  • 3. For each topic, we'll get the top 25 repositories in the topic from the topic page
  • 4. For each repository, we'll grab the repo name, username, stars and repo URL
  • 5. By the end of the project, we'll create a CSV and XLX file in the following format:

Image description

Instal and import all library needed

Before we begin lets install the library with pip

pip install requests
pip install beautifulsoup4 
pip install pandas
Enter fullscreen mode Exit fullscreen mode

then, lets import all the library into code editor

import requests as req
from bs4 import BeautifulSoup
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Write a function to download the page

def get_topic_link(base_url):
    response = req.get(url=base_url)
    page_content = response.text
    print(response.status_code)

    soup = BeautifulSoup(page_content, "html.parser")
    tags = soup.find_all("div", class_ = "py-4 border-bottom d-flex flex-justify-between")
    topic_links = []
    for tag in tags:
        url_end = tag.find("a")["href"]
        topic_link = f"https://www.github.com{url_end}"
        topic_links.append(topic_link)
    return topic_links
Enter fullscreen mode Exit fullscreen mode

write a function to extract information

def get_info_topic(topic_link):
    response1 = req.get(topic_link)
    topic_soup = BeautifulSoup(response1.text, "html.parser")
    topic_tag = topic_soup.find("h1").text.strip() #3D
    topic_desc = topic_soup.find("p").text
    info_topic = {
        "title" : topic_tag,
        "desc" : topic_desc
    }
    return info_topic


def get_info_tags(topic_link):
    response = req.get(topic_link)
    info_soup = BeautifulSoup(response.text, "html.parser")
    repo_tags = info_soup.find_all("div", class_ = "d-flex flex-justify-between flex-items-start flex-wrap gap-2 my-3")
    return repo_tags


def get_info(tag):
    repo_username = tag.find_all("a")[0]
    repo_name = tag.find_all("a")[1]

    url_end = tag.find("a")["href"]
    repo_url = f"https://www.github.com{url_end}"

    repo_star = tag.find("span",{"id":"repo-stars-counter-star"}).text.strip()
    repo_value = int(float(repo_star[:-1]) * 1000)

    topics_data = {
        "repo_name" : "repo_name",
        "repo_username" : repo_username.text.strip(),
        "repo_name" : repo_name.text.strip(),
        "repo_url" : repo_url,
        "repo_star" : repo_value,
        }
    return topics_data
Enter fullscreen mode Exit fullscreen mode

Create CSV file(s) with the extracted information

def save_CSV(results):
    df = pd.DataFrame(results)
    df.to_csv("github.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

Create XLX file(s) with the extracted information

def save_XLX(results):
    df = pd.DataFrame(results)
    df.to_excel("github.xlsx", index=False)
Enter fullscreen mode Exit fullscreen mode

Putting it all together

  • we have a function to get the list of topics
  • we have a function to create a CSV file for scraped repos from a topics page
  • Let's create a function to put them together
def main():
    base_url = "https://github.com/topics"
    topic_links = get_topic_link(base_url) # list of url ex: https://github.com/topics/3d, https://github.com/topics/AJAX, etc
    result2 = []
    for topic_link in topic_links: # https://github.com/topics/3d
        print(f"getting info {topic_link}")
        topic_tags = get_info_topic(topic_link) #title, desc
        repo_tags = get_info_tags(topic_link) # some repo tags, so we can use for loop
        result1 = []
        for tag in repo_tags:
            repo_info = get_info(tag)
            result1.append(repo_info)
        for x in result1:
            gabungan = topic_tags | x
            result2.append(gabungan)
        save_CSV(result2)
        save_XLX(result2)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Conclusion

We are done here, i hope this simple project can be valuable to your practice in python web scrapping.

Github :https://github.com/muchamadfaiz
Email : muchamadfaiz@gmail.com

Top comments (0)