*A case study with Linear, Decision Tree, Random Forest, SVR & Gradient Boosting Regressor*

# Introduction

When we talk of the market value of a soccer player, we refer to an estimate in monetary terms of the worth of the player in the world of soccer. This amount is usually what the player's club is willing to accept in order to sell or transfer the player's contract to a different club.

The ability to predict the market value of a soccer player may provide a commercial advantage to richer club as a small subset of soccer players are highlt valuable

In this article, I detail the steps used to predict a player's market value using some common machine learning regression algorithm in python

## Get the Dataset

The dataset used can be downloaded from https://www.kaggle.com/karangadiya/fifa19

## Prerequisite

Ensure you have the following installed on your machine

- Python
- NumPy
- Pandas
- Scikit-learn

# Building the Models

## Import the dependencies

```
# Data manipulation
import pandas as pd
import numpy as np
# Machine Learning Algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
# Model Selection and Evaluation
from sklearn.model_selection import train_test_split
# Feature Scaling
from sklearn.preprocessing import StandardScaler
# Performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
```

## Data Preprocessing

**Load the dataset**

```
dataset = pd.read_csv('data.csv')
```

**Dropping irrelevant columns**

Our dataset is composed of 18207 rows and 89 columns. Not all the columns will be needed for our models

```
to_stay=["ID","Age","Overall","Potential","Value","Wage"]
dataset.drop(dataset.columns.difference(to_stay),axis="columns",inplace=True)
```

**Set the ID column as the dataset index**

```
dataset.set_index("ID",inplace=True)
```

Our dataset is currently in this shape

**ID, Age, Overall, Potential, Value, Wage**

158023, 31, 94, 94, €110.5M, €565K

16254, 39, 72, 72, €210K, €3K

**Convert the Value column from string to numerical value**

As can be observed in our dataset the Value column is a string. using the Value like this will produced a suboptimal result for our regression models, so our next step is to convert it from string to number

```
# Remove the euros sign
dataset['Value'] = dataset['Value'].apply(lambda x: x.split('€')[1])
# Convert all value with 'M' to million and those with 'K' to thousand
dataset['Value'] = dataset['Value'].apply(
lambda x: float(x.split('M')[0])*1000000
if x.split('M').__len__() > 1 else float(x.split('K')[0])*1000
)
```

**Convert the Wage column from string to numerical value**

Just as we converted the Value column from string to number, we equally need to convert the Wage column from string to number

```
dataset['Wage'] = dataset['Wage'].apply(lambda x: x.split('€')[1])
dataset['Wage'] = dataset['Wage'].apply(
lambda x: float(x.split('M')[0])*1000000
if x.split('M').__len__() > 1 else float(x.split('K')[0])*1000
)
```

**Split the dataset into Training set and Test set**

```
X = dataset[['Age', 'Overall', 'Potential', 'Wage']]
y = dataset['Value']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
```

## Building our Models

**Multiple Linear Regression**

```
# Training the Multiple Linear Regression model on the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)
```

**Decision Tree Regression**

```
# Training the Decision Tree Regression model on the Training set
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
```

**Random Forest Regression model**

```
# Training the Random Forest Regression model on the Training set
regressor = RandomForestRegressor(n_estimators = 300)
regressor.fit(X_train, y_train)
```

**SVR model**

When building a SVR model we have to carry out feature scaling on our dataset before building the model

```
# Feature Scaling
sc_X = StandardScaler()
sc_y = StandardScaler()
scaled_X_train = sc_X.fit_transform(X_train)
scaled_y_train = sc_y.fit_transform(y_train.values.reshape(len(y_train),1))
# Training the SVR model on the Training set
regressor = SVR(kernel = 'rbf')
regressor.fit(scaled_X_train, scaled_y_train)
```

**Gradient Boosting Regression model**

```
# Training the Gradient Boosting Regression model on the Training set
regressor = GradientBoostingRegressor(n_estimators = 500)
regressor.fit(X_train, y_train)
```

# Predicting the Test set results

In order to predict the test set result for non SVR models

```
y_pred = regressor.predict(X_test)
```

For SVR model

```
y_pred = sc_y.inverse_transform(regressor.predict(sc_X.transform(X_test)))
```

# Evaluating the Models Performance

You can simply evaluate the different models performance by running the code below

```
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
score = r2_score(y_test, y_pred)
print('Accuracy:',format(score * 100,'.2f'),'%')
```

But a better way to evaluate your models is to use cross validation

```
scores = cross_val_score(estimator = regressor, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(scores.mean() * 100))
print("Standard Deviation: {:.2f} %".format(scores.std() * 100))
```

Evaluating the different models reveal that Gradient Boosting

Regression model has the best performance

# Predicting New Observations

```
# New observation
age = 20
overall_rating = 70
potential_rating = 80
wage = 50000
```

**For non SVR models**

```
regressor.predict([[age,overall_rating, potential_rating, wage]])
```

**For SVR model**

`sc_y.inverse_transform(regressor.predict(sc_X.transform([[age,overall_rating, potential_rating, wage]])))`

#

Conclusion

In this guide we were able to build several machine learning models to predict the market value for a soccer player.

Happy coding and enjoy machine learning

## Top comments (0)