A case study with Linear, Decision Tree, Random Forest, SVR & Gradient Boosting Regressor
Introduction
When we talk of the market value of a soccer player, we refer to an estimate in monetary terms of the worth of the player in the world of soccer. This amount is usually what the player's club is willing to accept in order to sell or transfer the player's contract to a different club.
The ability to predict the market value of a soccer player may provide a commercial advantage to richer club as a small subset of soccer players are highlt valuable
In this article, I detail the steps used to predict a player's market value using some common machine learning regression algorithm in python
Get the Dataset
The dataset used can be downloaded from https://www.kaggle.com/karangadiya/fifa19
Prerequisite
Ensure you have the following installed on your machine
- Python
- NumPy
- Pandas
- Scikit-learn
Building the Models
Import the dependencies
# Data manipulation
import pandas as pd
import numpy as np
# Machine Learning Algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
# Model Selection and Evaluation
from sklearn.model_selection import train_test_split
# Feature Scaling
from sklearn.preprocessing import StandardScaler
# Performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
Data Preprocessing
Load the dataset
dataset = pd.read_csv('data.csv')
Dropping irrelevant columns
Our dataset is composed of 18207 rows and 89 columns. Not all the columns will be needed for our models
to_stay=["ID","Age","Overall","Potential","Value","Wage"]
dataset.drop(dataset.columns.difference(to_stay),axis="columns",inplace=True)
Set the ID column as the dataset index
dataset.set_index("ID",inplace=True)
Our dataset is currently in this shape
ID, Age, Overall, Potential, Value, Wage
158023, 31, 94, 94, €110.5M, €565K
16254, 39, 72, 72, €210K, €3K
Convert the Value column from string to numerical value
As can be observed in our dataset the Value column is a string. using the Value like this will produced a suboptimal result for our regression models, so our next step is to convert it from string to number
# Remove the euros sign
dataset['Value'] = dataset['Value'].apply(lambda x: x.split('€')[1])
# Convert all value with 'M' to million and those with 'K' to thousand
dataset['Value'] = dataset['Value'].apply(
lambda x: float(x.split('M')[0])*1000000
if x.split('M').__len__() > 1 else float(x.split('K')[0])*1000
)
Convert the Wage column from string to numerical value
Just as we converted the Value column from string to number, we equally need to convert the Wage column from string to number
dataset['Wage'] = dataset['Wage'].apply(lambda x: x.split('€')[1])
dataset['Wage'] = dataset['Wage'].apply(
lambda x: float(x.split('M')[0])*1000000
if x.split('M').__len__() > 1 else float(x.split('K')[0])*1000
)
Split the dataset into Training set and Test set
X = dataset[['Age', 'Overall', 'Potential', 'Wage']]
y = dataset['Value']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Building our Models
Multiple Linear Regression
# Training the Multiple Linear Regression model on the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Decision Tree Regression
# Training the Decision Tree Regression model on the Training set
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
Random Forest Regression model
# Training the Random Forest Regression model on the Training set
regressor = RandomForestRegressor(n_estimators = 300)
regressor.fit(X_train, y_train)
SVR model
When building a SVR model we have to carry out feature scaling on our dataset before building the model
# Feature Scaling
sc_X = StandardScaler()
sc_y = StandardScaler()
scaled_X_train = sc_X.fit_transform(X_train)
scaled_y_train = sc_y.fit_transform(y_train.values.reshape(len(y_train),1))
# Training the SVR model on the Training set
regressor = SVR(kernel = 'rbf')
regressor.fit(scaled_X_train, scaled_y_train)
Gradient Boosting Regression model
# Training the Gradient Boosting Regression model on the Training set
regressor = GradientBoostingRegressor(n_estimators = 500)
regressor.fit(X_train, y_train)
Predicting the Test set results
In order to predict the test set result for non SVR models
y_pred = regressor.predict(X_test)
For SVR model
y_pred = sc_y.inverse_transform(regressor.predict(sc_X.transform(X_test)))
Evaluating the Models Performance
You can simply evaluate the different models performance by running the code below
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
score = r2_score(y_test, y_pred)
print('Accuracy:',format(score * 100,'.2f'),'%')
But a better way to evaluate your models is to use cross validation
scores = cross_val_score(estimator = regressor, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(scores.mean() * 100))
print("Standard Deviation: {:.2f} %".format(scores.std() * 100))
Evaluating the different models reveal that Gradient Boosting
Regression model has the best performance
Predicting New Observations
# New observation
age = 20
overall_rating = 70
potential_rating = 80
wage = 50000
For non SVR models
regressor.predict([[age,overall_rating, potential_rating, wage]])
For SVR model
sc_y.inverse_transform(regressor.predict(sc_X.transform([[age,overall_rating, potential_rating, wage]])))
Conclusion
In this guide we were able to build several machine learning models to predict the market value for a soccer player.
Happy coding and enjoy machine learning
Top comments (0)