During a class I took on Software Engineering while studying abroad, I developed a software application that used Naive Bayes to classify Github issues as resolved or simply closed.
This type of analysis allows developers to understand which repositories are fixing issues and which are simply closing them due to inactivity or other such reasons.
View Project
claire-1 / github-metrics
Github Metrics: The Classification of Github Issues as Resolved or Unresolved using Naive Bayes
Github Metrics: The Classification of Github Issues as Resolved or Unresolved Using Naive Bayes
I wrote a blog post about this project here!
I created this project for a Software Engineering class I took.
Project Description
For this project, I got comments from issues on Github repositories. I used Naive Bayes to do sentiment analysis on the comments to decide if they were really resolved or if they were simply closed. After I classify the issues, I display the results in a bar chat and in a bubble chart. The bar chart lets you see the number of issues resolved and closed over time. The bubble chart lets you see the individual resolved and closed issues and click on links to see more information about each issue.
How to run
This project is fully dockerized. You must have docker and docker-compose installed.
To run this project, go to the…
Technical Details
This project is fully dockerized. I used Java for the backend. The frontend is in Javascript. I used some SQL in this project as well.
Process
- In one docker container, this application fetches the comments from issues in a specified repository using this Github Java API.
- Next, it runs a Naive Bayes classifier on the last comment on the closed issue to decide whether the issue was actually resolved or just closed.
- This classification is then put into a SQL database running in another docker container.
- Once all the issues have been processed, the program queries the database to calculate the number of closed issues. It groups them by the times they were closed. It outputs the result in a JSON file.
- Next, it queries the database again to get the url associated with each issue so that it can display all the closed issues with clickable links.
- To generate the display, the start-up script for the main docker container runs the http-server command to start a webserver that is accessible from the host machine of the docker container.
Once this process completes, you can view the clustering of the issues by visiting http://localhost:8080/bubble.html or http://localhost:8080/bar-chart.html on your browser.
Example Display
Future Work
I used an API to access Github. This API is not perfect. It does not tell you if it cannot get all the issues for a repository due to rate limiting. Thus, if there are a large number of closed issues, this application might not display all of them since it doesn't know that there are more and doesn't have a way to get the others. Also, this API also includes closed pull requests when it says it is getting all the closed issues.
Additionally, the classification of issues as resolved or closed is heavily dependant on the training data. More and better training data would lead to more accurate predictions.
Final Thoughts
I learned a lot about developing iteratively while working on this project due to the numerous moving parts involved.
I also learned a lot about working with opensource code in making this project. The existing Java Github API made it so I could easily access data from Github, even if the API has a few bugs. Additionally, I used some existing Javascript code to display my final data, and found several tutorials on implementing Naive Bayes in Java that were immensely helpful. I could not have completed this project without the help of a vibrant opensource community! Thank you!
Header image source: https://pixabay.com/illustrations/question-mark-pile-questions-symbol-2492009/
Top comments (0)