DEV Community

Jonathan Utsu Undelikwo
Jonathan Utsu Undelikwo

Posted on

Predicting FIFA19 Player Market Value

A case study with Linear, Decision Tree, Random Forest, SVR & Gradient Boosting Regressor

Introduction

When we talk of the market value of a soccer player, we refer to an estimate in monetary terms of the worth of the player in the world of soccer. This amount is usually what the player's club is willing to accept in order to sell or transfer the player's contract to a different club.
The ability to predict the market value of a soccer player may provide a commercial advantage to richer club as a small subset of soccer players are highlt valuable

In this article, I detail the steps used to predict a player's market value using some common machine learning regression algorithm in python

Get the Dataset

The dataset used can be downloaded from https://www.kaggle.com/karangadiya/fifa19

Prerequisite

Ensure you have the following installed on your machine

  1. Python
  2. NumPy
  3. Pandas
  4. Scikit-learn

Building the Models

Import the dependencies

# Data manipulation
import pandas as pd
import numpy as np

# Machine Learning Algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

# Model Selection and Evaluation
from sklearn.model_selection import train_test_split

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
Enter fullscreen mode Exit fullscreen mode

Data Preprocessing

Load the dataset

dataset = pd.read_csv('data.csv')
Enter fullscreen mode Exit fullscreen mode

Dropping irrelevant columns
Our dataset is composed of 18207 rows and 89 columns. Not all the columns will be needed for our models

to_stay=["ID","Age","Overall","Potential","Value","Wage"]

dataset.drop(dataset.columns.difference(to_stay),axis="columns",inplace=True)
Enter fullscreen mode Exit fullscreen mode

Set the ID column as the dataset index

dataset.set_index("ID",inplace=True)
Enter fullscreen mode Exit fullscreen mode

Our dataset is currently in this shape
ID, Age, Overall, Potential, Value, Wage
158023, 31, 94, 94, €110.5M, €565K
16254, 39, 72, 72, €210K, €3K

Convert the Value column from string to numerical value
As can be observed in our dataset the Value column is a string. using the Value like this will produced a suboptimal result for our regression models, so our next step is to convert it from string to number

# Remove the euros sign
dataset['Value'] = dataset['Value'].apply(lambda x: x.split('€')[1])

# Convert all value with 'M' to million and those with 'K' to thousand
dataset['Value'] = dataset['Value'].apply(
    lambda x: float(x.split('M')[0])*1000000 
    if x.split('M').__len__() > 1 else float(x.split('K')[0])*1000
)
Enter fullscreen mode Exit fullscreen mode

Convert the Wage column from string to numerical value
Just as we converted the Value column from string to number, we equally need to convert the Wage column from string to number

dataset['Wage'] = dataset['Wage'].apply(lambda x: x.split('€')[1])
dataset['Wage'] = dataset['Wage'].apply(
    lambda x: float(x.split('M')[0])*1000000 
    if x.split('M').__len__() > 1 else float(x.split('K')[0])*1000
)
Enter fullscreen mode Exit fullscreen mode

Split the dataset into Training set and Test set

X = dataset[['Age', 'Overall', 'Potential', 'Wage']]
y = dataset['Value']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Enter fullscreen mode Exit fullscreen mode

Building our Models

Multiple Linear Regression

# Training the Multiple Linear Regression model on the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Decision Tree Regression

# Training the Decision Tree Regression model on the Training set
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Random Forest Regression model

# Training the Random Forest Regression model on the Training set
regressor = RandomForestRegressor(n_estimators = 300)
regressor.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

SVR model
When building a SVR model we have to carry out feature scaling on our dataset before building the model

# Feature Scaling
sc_X = StandardScaler()
sc_y = StandardScaler()
scaled_X_train = sc_X.fit_transform(X_train)
scaled_y_train = sc_y.fit_transform(y_train.values.reshape(len(y_train),1))

# Training the SVR model on the Training set
regressor = SVR(kernel = 'rbf')
regressor.fit(scaled_X_train, scaled_y_train)
Enter fullscreen mode Exit fullscreen mode

Gradient Boosting Regression model

# Training the Gradient Boosting Regression model on the Training set
regressor = GradientBoostingRegressor(n_estimators = 500)
regressor.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Predicting the Test set results

In order to predict the test set result for non SVR models

y_pred = regressor.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

For SVR model

y_pred = sc_y.inverse_transform(regressor.predict(sc_X.transform(X_test)))
Enter fullscreen mode Exit fullscreen mode

Evaluating the Models Performance

You can simply evaluate the different models performance by running the code below

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
score = r2_score(y_test, y_pred)
print('Accuracy:',format(score * 100,'.2f'),'%')
Enter fullscreen mode Exit fullscreen mode

But a better way to evaluate your models is to use cross validation

scores = cross_val_score(estimator = regressor, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(scores.mean() * 100))
print("Standard Deviation: {:.2f} %".format(scores.std() * 100))
Enter fullscreen mode Exit fullscreen mode

Evaluating the different models reveal that Gradient Boosting
Regression model has the best performance

Predicting New Observations

# New observation
age = 20
overall_rating = 70
potential_rating = 80
wage = 50000
Enter fullscreen mode Exit fullscreen mode

For non SVR models

regressor.predict([[age,overall_rating, potential_rating, wage]])   
Enter fullscreen mode Exit fullscreen mode

For SVR model

sc_y.inverse_transform(regressor.predict(sc_X.transform([[age,overall_rating, potential_rating, wage]])))
Enter fullscreen mode Exit fullscreen mode




Conclusion

In this guide we were able to build several machine learning models to predict the market value for a soccer player.

Happy coding and enjoy machine learning

Oldest comments (0)