Khushi Singla

Posted on Nov 10

Predicting Football Player Market Value with a Simple ML Pipeline (Pandas + Scikit-Learn)

#machinelearning #datascience #python #football

In this project, I explore how to predict football player market value using a clean and simple ML pipeline based on Python, Pandas, Seaborn, and Scikit-Learn.

📌 Full code and notebook available on GitHub:
👉 https://github.com/KhushiSingla-tech/Football-player-price-pridiction

Dataset & Setup

We start by loading a local data.csv.

import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns

dataset = pd.read_csv('data.csv')

First look:

dataset.head()
dataset.columns
dataset.describe()
dataset.shape
dataset.dtypes
dataset['nationality'].value_counts()

Tip: keep an eye on data types and missing values. position_cat should already be numeric in this workflow—if it weren’t, we’d need to encode it first.

Quick EDA

Below are the core visuals I generated. I’m including placeholders so you can drop screenshots from your notebook. Keep the titles the same to stay consistent.

1. Name vs Age (top 50)

plt.figure(figsize=(10,6))
graph = sns.barplot(x='name', y='age', data=dataset[:50], palette="rocket")
graph.set(xlabel="Name", ylabel="Age", title="Name VS Age")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('talk'); sns.despine(); plt.show()

2. Members per Club

plt.figure(figsize=(10,6))
graph = sns.countplot(x='club', data=dataset, palette="vlag")
graph.set(xlabel="Club", ylabel="Member", title="Members per club")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('talk'); sns.despine(); plt.show()

3. Name vs Market Value (top 50)

plt.figure(figsize=(16,6))
graph = sns.barplot(x='name', y='market_value', data=dataset[:50], palette="colorblind")
graph.set(xlabel="Name", ylabel="Market Value", title="Name VS Market Value")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('notebook'); sns.despine(); plt.show()

4. Name vs Position Category (top 50)

plt.figure(figsize=(16,6))
graph = sns.pointplot(x='name', y='position_cat', data=dataset[:50], palette="deep")
graph.set(xlabel="Name", ylabel="Position category", title="Name VS Position Category")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('talk'); sns.despine(); plt.show()

5. Name vs Region (top 50)

plt.figure(figsize=(16,6))
graph = sns.pointplot(x='name', y='region', data=dataset[:50], palette="rocket")
graph.set(xlabel="Name", ylabel="Region", title="Name VS Region")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('poster'); sns.despine(); plt.show()

6. Players by Nationality

plt.figure(figsize=(20,6))
graph = sns.countplot(x='nationality', data=dataset, palette="muted")
graph.set(xlabel="Nationality", ylabel="Players", title="No. of players amoung different nationality")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('paper'); sns.despine(); plt.show()

7. Players by Region

graph = sns.countplot(x='region', data=dataset, palette="vlag")
graph.set(xlabel="Region", ylabel="Players", title="No. of players amoung various regions")
sns.set_context('paper'); sns.despine(); plt.show()

8. Name vs FPL Points (top 50)

plt.figure(figsize=(16,6))
graph = sns.barplot(x='name', y='fpl_points', data=dataset[:50], palette="pastel")
graph.set(xlabel="Name", ylabel="FPL Points", title="Name VS FPL points")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('poster'); sns.despine(); plt.show()

9. Name vs FPL Value (top 50)

plt.figure(figsize=(16,6))
graph = sns.pointplot(x='name', y='fpl_value', data=dataset[:50], palette="dark")
graph.set(xlabel="Name", ylabel="FPL Value", title="Name VS FPL value")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('notebook'); sns.despine(); plt.show()

10. New Foreign (Count)

graph = sns.countplot(x='new_foreign', data=dataset, palette="dark")
graph.set(xlabel="New Foreign", ylabel="Amount", title="How many are new signing from a different league")
sns.set_context('notebook'); sns.despine(); plt.show()

11. New Foreign (By Name)

plt.figure(figsize=(20,6))
graph = sns.pointplot(x='name', y='new_foreign', data=dataset[:100], palette="dark")
graph.set(xlabel="Name", ylabel="New Foreign", title="Whether a new signing from a different league")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('notebook'); sns.despine(); plt.show()

12. New Signing (Count)

graph = sns.countplot(x='new_signing', data=dataset, palette="rocket")
graph.set(xlabel="New Signing", ylabel="Amount", title="How many are new signing ")
sns.set_context('notebook'); sns.despine(); plt.show()

13. New Signing (By Name)

plt.figure(figsize=(20,6))
graph = sns.pointplot(x='name', y='new_signing', data=dataset[:100], palette="bright")
graph.set(xlabel="Name", ylabel="New Signing", title="Whether a new signing")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('notebook'); sns.despine(); plt.show()

Feature Selection

For modeling, I use the following five predictors:

dataset = pd.read_csv('data.csv') 
X = dataset[['age', 'fpl_value', 'fpl_points', 'page_views', 'position_cat']]
Y = dataset['market_value']

Why these?

age – price typically varies with age/prime years.
fpl_value, fpl_points – performance and fantasy value often correlate with perceived market value.
page_views – a soft proxy for popularity/visibility.
position_cat – price dynamics differ by position.

Train/Test Split + Scaling

from sklearn.model_selection import train_test_split 
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.2, random_state=0
)

from sklearn.preprocessing import StandardScaler 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test)

Why scaling? Linear models are sensitive to feature scales; standardization helps stable coefficients and convergence.

Model: Linear Regression

from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() 
regressor.fit(X_train, Y_train)

Make predictions on the test set:

Y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})
df.head()

To see predictions across the full dataset:

X1 = sc_X.transform(X) 
Y_pred1 = regressor.predict(X1)

Output: -

array([ 6.28665327e+01,  4.63344124e+01,  1.71999185e+01,  2.70542543e+01,
        1.67576333e+01,  2.31341678e+01,  3.06614290e+01,  1.26441319e+01,
        1.88859345e+01,  1.61778959e+01,  1.68413424e+01,  1.87550870e+01,
        1.53217919e+01,  2.00342164e+01,  5.03340141e+00,  9.45484552e+00,
        8.81515716e+00,  1.65135952e+01,  2.15921316e+01,  1.14508041e+01,
        4.94468838e+00,  2.20113374e+01,  5.06376105e+00,  9.02558975e+00,
        5.66032107e+00,  5.82966111e+00,  1.40691543e+01,  3.47309764e+01,
        2.54661370e+01,  3.17429798e+01,  9.84330289e+00,  6.72138221e+00,
        1.06760045e+01,  1.16738703e+01,  1.04311591e+01,  1.11660553e+01,
        4.62853856e+00,  1.29618040e+01,  8.85989508e+00,  2.93968695e+00,
        1.09260466e+01,  1.40564779e+01,  5.83211310e+00,  1.66622152e+00,
        6.92188866e+00,  5.12775349e+00,  4.77049269e+00,  8.60744252e-01,
        1.76621052e+00,  5.65940283e+00,  3.01078473e+00,  6.11320445e+00,
        4.77661984e-01,  6.86850732e+00,  4.09268232e+00,  4.45776943e+00,
       -1.21186324e+00, -4.32716636e+00,  2.27473673e+00, -1.88866900e+00,
        1.57521202e+00,  1.59605849e+00,  9.93010176e+00,  1.14337555e+00,
        4.70114334e-01, -5.69705594e-01,  5.49109710e+00,  2.42127613e+00,
        1.60229262e+00,  1.68421329e+00,  6.44917242e+00,  5.78379753e+00,
        9.27755719e-01,  1.58683879e+00,  1.39307739e+01,  1.21129834e+01,
        1.60684486e+01,  6.89538099e+00,  5.06714267e+00,  6.16871528e+00,
        7.97262607e+00,  1.09000523e+01,  6.59496956e+00,  8.34852473e+00,
        7.41193892e-01,  2.90238204e+00,  4.10618827e+00,  1.04340149e+01,
        5.29097527e+00,  1.77164410e+00,  9.12243523e-01, -1.71626246e+00,
        5.39393523e+01,  5.02586215e+01,  2.33287914e+01,  3.38398821e+01,
        2.27397252e+01,  2.76023255e+01,  2.06741282e+01,  2.40857909e+01,
        2.55614322e+01,  1.97780734e+01,  2.45654940e+01,  1.00361670e+01,
        2.12049333e+01,  8.44089635e+00,  2.72273014e+01,  1.32175831e+01,
        1.00032029e+01,  2.07556329e-02,  1.34878506e+01,  8.99831431e+00,
        2.46911471e+01,  2.84420802e+01,  1.37693399e+01,  1.39753539e+01,
        4.14940785e+00,  7.89296807e+00,  4.59603873e+00,  9.33580921e+00,
        9.17322247e+00,  5.91166865e+00,  7.20501161e+00,  1.62985464e+00,
        9.77106295e+00,  5.24199276e+00,  6.77506385e+00, -1.10826023e-01,
        6.70876820e+00,  1.51229913e+00,  3.64913411e+00,  5.86341313e+00,
       -1.71934842e+00,  2.64918075e+01,  1.53767278e+01,  2.08364791e+01,
        1.28360458e+01,  1.42769184e+01,  1.83438607e+01,  8.39739780e+00,
        1.54972711e+01,  9.24879295e+00,  8.84735296e+00,  4.27543923e+01,
        1.00334222e+01,  6.67323145e+00,  4.24451728e+00,  1.64599406e+01,
        8.93319695e+00,  1.36017222e+01,  9.23623029e+00,  4.75455136e+00,
        5.70086565e+00,  5.26865228e+00,  6.05140101e+00,  1.27525174e+01,
        3.33916737e+00,  5.90391575e+00,  2.80368956e+00,  1.71560146e+01,
        1.77938304e+01,  4.82038725e+00,  4.59853329e+00,  3.02243318e+00,
        2.64504042e+00,  6.44877606e+00,  3.59123084e+00, -8.69915834e-01,
        2.93424055e+00, -2.94119546e+00,  3.14294737e+00,  3.46717211e+00,
        8.45537926e+00,  2.31326804e+00,  5.60399756e-01,  3.76573016e+00,
        1.30656073e-01,  1.83081491e+00, -2.04944876e+00,  2.08635844e-01,
       -6.24129449e-01,  8.17555691e+00,  9.17157126e+00,  7.59654639e+00,
        5.81263224e+00,  2.15563645e+00, -4.78803512e-01,  4.23712894e+00,
        8.80441122e+00,  3.50444501e+01,  3.07869954e+01,  1.59727357e+01,
        9.11841002e+00,  1.07657645e+01,  9.27215255e+00,  6.23564852e+00,
        1.96676802e+01,  1.10408239e+01,  8.46461158e+00,  4.94055844e+00,
        6.76284129e+00,  1.13816079e+01,  1.06885705e+01,  6.05253148e+00,
        2.99910714e+00,  1.44952259e+01,  3.61337549e+00,  2.60868278e+00,
        7.41538690e+00,  2.92184993e+00,  4.45153531e+00,  3.78618026e+00,
        8.64074595e+00,  3.50352917e+01,  4.00189086e+01,  4.30005449e+01,
        2.43905161e+01,  2.12379487e+01,  2.52051293e+01,  1.54203647e+01,
        1.33076223e+01,  1.40963410e+01,  1.33599626e+01,  1.71506799e+01,
        2.39610467e+01,  1.30875717e+01,  2.62851963e+01,  4.77089334e+00,
        5.82049375e+00,  1.00155278e+01,  1.43421271e+01,  8.23561008e+00,
        5.30518579e+00,  7.40927061e+00,  5.53990619e+00,  7.30884677e+00,
        5.55237174e+00,  2.70476182e+01,  7.80253031e+00,  6.98958048e+00,
        4.33994048e+01,  5.54702805e+01,  3.20615354e+01,  2.21769627e+01,
        2.53293890e+01,  3.42494372e+01,  1.31454153e+01,  1.21667107e+01,
        1.97419795e+01,  6.30015138e+00,  8.82081099e+00,  5.06400874e+01,
        1.57180792e+01,  1.44266939e+01,  2.25336943e+01,  1.35383098e+01,
        5.05587301e-01,  3.04413219e+00,  1.27785472e+01,  2.29855522e+01,
        5.81554453e+01,  2.31388392e+01,  2.16422756e+01,  5.10021411e+01,
        2.18816842e+01,  2.28615104e+01,  1.47747468e+01,  1.52930900e+01,
        3.17042563e+01,  1.46778860e+01,  3.35870220e+01,  3.10865733e+01,
        1.64592195e+01,  1.69076901e+01,  1.11555766e+01,  1.29238188e+01,
        9.61917212e+00,  1.25204674e+01,  5.30423360e+00,  5.54871109e+00,
        1.00654973e+01,  4.98946114e+00,  7.42955630e+00,  6.31904380e+00,
        1.30403887e+01,  1.00392522e+00,  6.06577036e+00,  6.72085484e+00,
        3.73134019e+00,  4.18149614e+00,  4.22383012e+00,  1.91381534e+00,
       -1.28088146e+00,  3.41045955e+00,  1.79497319e+00,  1.05829507e+01,
        9.43534298e+00,  3.28532558e+00,  1.35250312e+00,  4.38494855e+00,
        2.07244010e+00,  1.75906926e+00,  1.43509874e+01,  6.87198964e+00,
        6.13033140e+00, -2.33888264e+00,  1.35990916e+01,  1.80634481e+01,
        1.46927448e+01,  1.44239848e+01,  1.12612981e+01,  1.29791664e+01,
        6.24648586e+00,  9.03304108e+00,  5.25226415e+00,  1.51571160e+01,
        1.41645531e+01,  1.01954205e+01,  9.91526470e+00,  1.17886001e+01,
        3.27217678e+00,  5.94602737e+00,  2.11646419e+01,  4.44317689e+00,
        6.10421380e+00,  2.47182931e+00,  9.65090120e-01,  5.44983836e+00,
        5.29387882e+00,  1.22565818e+01,  1.78140601e+01,  7.59876326e+00,
        9.32675979e+00,  5.76379440e+00,  9.09217109e+00,  1.39168840e+01,
        9.92486243e+00,  1.19482597e+00,  5.87981968e+00,  3.78964322e+00,
        4.60747657e+00,  6.09054273e+00,  7.18585630e+00,  5.48772878e+00,
        1.01458041e+01,  2.99328230e+00,  1.21505520e+01,  3.70836344e+00,
        1.08663018e+01,  3.86679739e+00,  7.27572286e+00,  2.95082669e+01,
        2.28551986e+01,  8.11729167e+00,  1.09926147e+01,  1.23986210e+01,
        6.04083954e+00,  5.54532532e+00,  5.13323449e+00,  8.13637450e+00,
        5.03507346e+00,  6.15626580e+00,  5.97910234e+00,  5.35986738e+00,
       -2.24795766e-01,  6.53422485e+00, -1.30726559e+00,  4.22683927e+00,
        3.46130466e+00,  2.62146976e+00,  5.16180438e+00, -6.12786456e-01,
        1.87310775e+00,  2.18184157e+00,  6.84458483e-01,  1.15822758e+01,
        5.43069136e+01,  6.54671081e+01,  3.79330854e+01,  3.21481995e+01,
        1.72037115e+01,  1.59411311e+01,  1.74161390e+01,  1.11834417e+01,
        1.28956970e+01,  1.40528607e+01,  1.94235419e+01,  9.06373403e+00,
        2.21515945e+01,  1.22072454e+01,  1.16944338e+01,  9.81890962e+00,
        1.34160836e+01,  6.47409346e+00,  8.09325423e+00,  5.34652950e+00,
        1.15204048e+01,  1.72578040e+01,  6.88453524e+00,  7.38962753e+00,
        3.82242989e+00,  4.63910853e+00,  3.52678463e+00,  4.17896473e+00,
        7.89789526e+00,  8.12813073e+00,  1.27843162e+00,  9.97198566e+00,
        8.03247501e+00,  6.45394674e+00,  5.12676355e+00,  3.96953706e+00,
        9.13592851e+00,  1.70473040e-01,  6.70820319e+00,  4.63104526e+00,
        4.59259892e+00,  3.44236036e+00,  2.69420533e+00,  6.32927942e+00,
        7.78319819e+00,  1.53732744e+01,  1.28845013e+01,  7.87680461e+00,
        1.26908110e+01,  9.99255503e+00,  8.46245403e+00,  8.29923735e+00,
        8.85993621e+00,  8.23448929e+00,  8.17877211e+00,  1.09487410e+01,
        2.08215886e+00,  4.43309931e+00,  5.53899424e+00,  3.45700961e+00,
        4.43442418e+00, -3.65736914e-01,  3.47071664e+00,  1.49935272e+01,
        1.97778766e+01,  2.34133710e+01,  9.59576755e+00,  9.65451702e+00,
        1.81178362e+01,  7.31740896e+00,  8.88841637e+00,  7.79515304e+00,
        3.52915171e+00,  1.30045353e+01,  6.92258405e+00,  8.99422402e+00,
        4.63065558e+00,  7.37182795e+00,  4.50268052e+00,  7.47405175e+00,
        5.66246456e+00,  6.41619652e+00,  6.22417826e+00,  3.21996267e+00,
        5.11293760e+00])

Evaluation: 10-Fold Cross-Validation

from sklearn.model_selection import cross_val_score 
accuracy = cross_val_score(estimator=regressor, X=X_train, y=Y_train, cv=10)
print(accuracy.mean())
print(accuracy.std())

Metric: By default, LinearRegression with cross_val_score uses the estimator’s .score() which is R².
Report:
- CV Mean R²: {{CV_MEAN_R2}}
- CV Std: {{CV_STD_R2}}

If you prefer error metrics, add: from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score and compute MAE/RMSE/R² on the held-out test set.

What I Learned

Even a small, clean feature set can drive a reasonable baseline.
fpl_value and fpl_points usually show strong signal; if you have access to richer performance data (minutes, xG, assists/90, age-curve features), add them.
page_views captures attention, which influences pricing; try other popularity proxies.
Consider regularized models (Ridge/Lasso) or tree ensembles (RandomForest, XGBoost) and compare CV scores.
Plot residuals vs. predicted to check for systematic under/over-valuation (especially on very high-value players).

Follow-Up Questions

I’d love to hear your thoughts!

Which additional features would you include to improve prediction accuracy?
Do you think football player value is more influenced by performance or popularity metrics?
Would you like to see a version of this project using RandomForest/XGBoost?
Should I deploy this model as an interactive web app where you can enter player stats and get predictions?

Feel free to comment below — I’d love to discuss and expand this project further!