Jonathan Utsu Undelikwo

Posted on Oct 27, 2020

Predicting FIFA19 Player Market Value

#machinelearning #ai #datascience #python

A case study with Linear, Decision Tree, Random Forest, SVR & Gradient Boosting Regressor

Introduction

When we talk of the market value of a soccer player, we refer to an estimate in monetary terms of the worth of the player in the world of soccer. This amount is usually what the player's club is willing to accept in order to sell or transfer the player's contract to a different club.
The ability to predict the market value of a soccer player may provide a commercial advantage to richer club as a small subset of soccer players are highlt valuable

In this article, I detail the steps used to predict a player's market value using some common machine learning regression algorithm in python

Get the Dataset

The dataset used can be downloaded from https://www.kaggle.com/karangadiya/fifa19

Prerequisite

Ensure you have the following installed on your machine

Python
NumPy
Pandas
Scikit-learn

Building the Models

Import the dependencies

# Data manipulation
import pandas as pd
import numpy as np

# Machine Learning Algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

# Model Selection and Evaluation
from sklearn.model_selection import train_test_split

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

Data Preprocessing

Load the dataset

dataset = pd.read_csv('data.csv')

Dropping irrelevant columns
Our dataset is composed of 18207 rows and 89 columns. Not all the columns will be needed for our models

to_stay=["ID","Age","Overall","Potential","Value","Wage"]

dataset.drop(dataset.columns.difference(to_stay),axis="columns",inplace=True)

Set the ID column as the dataset index

dataset.set_index("ID",inplace=True)

Our dataset is currently in this shape
ID, Age, Overall, Potential, Value, Wage
158023, 31, 94, 94, €110.5M, €565K
16254, 39, 72, 72, €210K, €3K

Convert the Value column from string to numerical value
As can be observed in our dataset the Value column is a string. using the Value like this will produced a suboptimal result for our regression models, so our next step is to convert it from string to number

# Remove the euros sign
dataset['Value'] = dataset['Value'].apply(lambda x: x.split('€')[1])

# Convert all value with 'M' to million and those with 'K' to thousand
dataset['Value'] = dataset['Value'].apply(
    lambda x: float(x.split('M')[0])*1000000 
    if x.split('M').__len__() > 1 else float(x.split('K')[0])*1000
)

Convert the Wage column from string to numerical value
Just as we converted the Value column from string to number, we equally need to convert the Wage column from string to number

dataset['Wage'] = dataset['Wage'].apply(lambda x: x.split('€')[1])
dataset['Wage'] = dataset['Wage'].apply(
    lambda x: float(x.split('M')[0])*1000000 
    if x.split('M').__len__() > 1 else float(x.split('K')[0])*1000
)

Split the dataset into Training set and Test set

X = dataset[['Age', 'Overall', 'Potential', 'Wage']]
y = dataset['Value']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Building our Models

Multiple Linear Regression

# Training the Multiple Linear Regression model on the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Decision Tree Regression

# Training the Decision Tree Regression model on the Training set
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

Random Forest Regression model

# Training the Random Forest Regression model on the Training set
regressor = RandomForestRegressor(n_estimators = 300)
regressor.fit(X_train, y_train)

SVR model
When building a SVR model we have to carry out feature scaling on our dataset before building the model

# Feature Scaling
sc_X = StandardScaler()
sc_y = StandardScaler()
scaled_X_train = sc_X.fit_transform(X_train)
scaled_y_train = sc_y.fit_transform(y_train.values.reshape(len(y_train),1))

# Training the SVR model on the Training set
regressor = SVR(kernel = 'rbf')
regressor.fit(scaled_X_train, scaled_y_train)

Gradient Boosting Regression model

# Training the Gradient Boosting Regression model on the Training set
regressor = GradientBoostingRegressor(n_estimators = 500)
regressor.fit(X_train, y_train)

Predicting the Test set results

In order to predict the test set result for non SVR models

y_pred = regressor.predict(X_test)

For SVR model

y_pred = sc_y.inverse_transform(regressor.predict(sc_X.transform(X_test)))

Evaluating the Models Performance

You can simply evaluate the different models performance by running the code below

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
score = r2_score(y_test, y_pred)
print('Accuracy:',format(score * 100,'.2f'),'%')

But a better way to evaluate your models is to use cross validation

scores = cross_val_score(estimator = regressor, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(scores.mean() * 100))
print("Standard Deviation: {:.2f} %".format(scores.std() * 100))

Evaluating the different models reveal that Gradient Boosting
Regression model has the best performance

Predicting New Observations

# New observation
age = 20
overall_rating = 70
potential_rating = 80
wage = 50000

For non SVR models

regressor.predict([[age,overall_rating, potential_rating, wage]])

For SVR model

sc_y.inverse_transform(regressor.predict(sc_X.transform([[age,overall_rating, potential_rating, wage]])))

Conclusion

In this guide we were able to build several machine learning models to predict the market value for a soccer player.

Happy coding and enjoy machine learning

DEV Community

Predicting FIFA19 Player Market Value

Introduction

Get the Dataset

Prerequisite

Building the Models

Import the dependencies

Data Preprocessing

Building our Models

Predicting the Test set results

Evaluating the Models Performance

Predicting New Observations

Conclusion

Top comments (0)

Read next

Google’s Most Powerful AI Yet: Google Gemini 2.0 Explained

Google's LearnLM: AI Model Gets Teaching Upgrade to Boost Educational Performance

Django CRUD Application Tutorial | Step-by-Step Guide to Master Create, Read, Update, Delete

Creating a Dog Care Calculator Using JavaScript