I worked on this project as part of the finals for my Artificial Intelligence class. This project uses Machine Learning to predict the outcome of a football match when given some stats from half time.
You can check out the demo here: https://football-predictor.projects.aziztitu.com/
You can check out the source code here: https://github.com/AzeeSoft/football-match-predictor
For this project, I decided to use Python since I was very familiar with it, and also because it had a lot of awesome tools for machine learning.
Firstly, in order for this match prediction to work, I needed some good datasets. After looking around for a while, I found this site: https://datahub.io/collections/football, which contained structured datasets for a variety of football competitions ranging from national leagues to world cups. But to keep things simple, I decided to select the datasets for the top 5 European Leagues that contained the match results for the last 9 years.
Even though the data was already structured, I had to clean up some missing/misleading data. You can check out the repo for more info on this process.
After cleaning up the data, I performed a wide variety of data analysis techniques such as Chi-Squared analysis, and calculating the Variance Inflation Factor, to extract the most important features.
With this processed data, I trained 3 different models, namely:
- Naive Bayes
- Random Forest
- Logistic Regression (One vs Rest)
After further tweaks and adjustments, both Logistic Regression model and the Random Forest model had the best performance with 70% accuracy, and the Naive Bayes model had around 65%.
I really wanted to show off this model to my friends and professors. So, I decided to deploy it on a remote server.
To do this, I exported the trained model into a file using a python package called 'joblib'. Then, I created a simple Django Web Server with a REST API that uses this trained model, and makes the prediction. You can check out the final result here: https://football-predictor.projects.aziztitu.com/
NOTE: For a more detailed description of the process, check out the Readme in the repo.
Initially, I did not think I was gonna get 70% accuracy with these models. But it is really cool to see it in action. Analysing the dataset, preprocessing it, and selecting the right features were the most stressful portions of this project. But looking back, it was all totally worth it.
Some things I'd like to add to this project in the feature are:
One of the drawbacks at the moment is that the teams don't have a huge impact on the outcome. But in practice, that plays a huge role. A first-division team has a much higher chance of winning a game against a third-division team, even if the match was played at the third-division team's home ground.
The other thing I want the model to take into account is the ability of a team to bounce back. There are certain teams in football that play defensive in the first half, and are more aggressive in the second half or vice versa.
In order for the model to take these things into account, I plan to pre-compute these values for each team and store them locally. I can re-train the models with these features and during prediction, I can use the respective team's pre-computed values as supplemental features which should help it make better predictions.
I'd like the model to also take the players on the pitch into consideration when making the prediction. In practice, a team has a higher chance of winning the game when its star players are on the pitch.
This is more of a long shot. As of now, the model makes the prediction based on the half-time stats. Eventually I'd like the model to predict the results for a live match all the way from minute 0 to minute 90. To do this, it must learn to account for the current match time. But training the model to account for this is going to be extremely hard.