DEV Community

Cover image for TAMU DATATHON
Revanth Reddy
Revanth Reddy

Posted on

TAMU DATATHON

DaHack - Hackathons Dataset!

GitHub Link: https://github.com/revanth-reddy/tamudatathon

Hackathon's Dataset in Hackathon

Python enter image description here enter image description here

Open In Colab

This Dataset is built as a part of TAMU Datathon 2021. It contains the details of hackathons which include Title, Location, Start Date, End Date, Prize Money, Number of Participants, Host of Hackathon, Themes in the Hackathon

As a part of Tamu Datathon 2021 challenges, we chose TD Data Synthesis Challenge. We approached the problem of synthesis of Hackathons Dataset throughout the world.

Construction of Dataset

The data for this dataset is obtained by Web Scraping using Python and Selenium.

Size

  • We are able to generate a dataset of 6,236 hackathons due to time constraints. However this size can be tremendously increased.

Uniqueness

  • This is the first Hackathon's Dataset with 8 unique features that are the major datapoints of a Hackathon

Usefulness

  • To Observe the trends:
    • Number of Hackathons per year.
    • Hackathons Prize Money Trend over years.
    • Rise of Themes through the years.
    • Participation in the Hackathon across years/themes.
    • And many more ...
  • To Build ML models in order :
    • to predict Prize Money based on other features like participation, themes, location.
    • to estimate user participation based on location, prize money, themes, etc.
    • to analyze relationship between the above features.

Visualization

Visualization Image

Visualization Image

Visualization Image

Visualization Image

Team's Approach

  • To build a dataset, we need to find a potential and authentic source, for which we used devpost as the source.
  • To get the data from Website, we used Web Scraping tools which helps in getting the raw data
  • We have used Selenium to scrape the website.
  • We used a Chrome Web Driver to create the automation of scrolling and scraping the data.
  • After Scraping, we got the raw data which is processed further and cleaned.
  • The clean data is stored as a dataset in a CSV file.
  • Unique datapoints like Number of participants, Start Date, End Date, Location(Online/Offline location), Prize Money, Themes
  • After obtaining a decent dataset, we used Google Studio to provide insights of the dataset and relations between the variables in the dataset.

Reflection

  • ### Problems
    • While scraping the website, the content inside website is not rendered as static content.
    • As the content is dynamic one, it is hard to scrape.
    • The website has a pagination effect based on page scrolling. So, initially x items are loaded in the website and on scrolling another x items are loaded and so on.
    • Unless we scroll we can't scrape the entire dataset.
  • ### Future Scope
    • The Dataset can be used to build a Machine Learning Model (to help in picking a theme/prize money/participant estimation and also in analyzing the trends of Hackathons)
    • Automation of Scraping using Job Schedulers(CI/CD) to scrape data periodically
    • Large data can be scraped from various sources
    • Scraping process can be further optimized by using mutlithreading. ## Team DaHack

Site https://github.com/pcarribean/ Site https://github.com/msharsha/ Site https://github.com/revanth-reddy/

Linkedin: manisha-bachu Linkedin: msharsha Linkedin: revanth--reddy

Team Image

Top comments (0)