In this project, I explore how to predict football player market value using a clean and simple ML pipeline based on Python, Pandas, Seaborn, and Scikit-Learn.
📌 Full code and notebook available on GitHub:
👉 https://github.com/KhushiSingla-tech/Football-player-price-pridiction
Dataset & Setup
We start by loading a local data.csv.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
dataset = pd.read_csv('data.csv')
First look:
dataset.head()
dataset.columns
dataset.describe()
dataset.shape
dataset.dtypes
dataset['nationality'].value_counts()
Tip: keep an eye on data types and missing values. position_cat should already be numeric in this workflow—if it weren’t, we’d need to encode it first.
Quick EDA
Below are the core visuals I generated. I’m including placeholders so you can drop screenshots from your notebook. Keep the titles the same to stay consistent.
1. Name vs Age (top 50)
plt.figure(figsize=(10,6))
graph = sns.barplot(x='name', y='age', data=dataset[:50], palette="rocket")
graph.set(xlabel="Name", ylabel="Age", title="Name VS Age")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('talk'); sns.despine(); plt.show()
2. Members per Club
plt.figure(figsize=(10,6))
graph = sns.countplot(x='club', data=dataset, palette="vlag")
graph.set(xlabel="Club", ylabel="Member", title="Members per club")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('talk'); sns.despine(); plt.show()
3. Name vs Market Value (top 50)
plt.figure(figsize=(16,6))
graph = sns.barplot(x='name', y='market_value', data=dataset[:50], palette="colorblind")
graph.set(xlabel="Name", ylabel="Market Value", title="Name VS Market Value")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('notebook'); sns.despine(); plt.show()
4. Name vs Position Category (top 50)
plt.figure(figsize=(16,6))
graph = sns.pointplot(x='name', y='position_cat', data=dataset[:50], palette="deep")
graph.set(xlabel="Name", ylabel="Position category", title="Name VS Position Category")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('talk'); sns.despine(); plt.show()
5. Name vs Region (top 50)
plt.figure(figsize=(16,6))
graph = sns.pointplot(x='name', y='region', data=dataset[:50], palette="rocket")
graph.set(xlabel="Name", ylabel="Region", title="Name VS Region")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('poster'); sns.despine(); plt.show()
6. Players by Nationality
plt.figure(figsize=(20,6))
graph = sns.countplot(x='nationality', data=dataset, palette="muted")
graph.set(xlabel="Nationality", ylabel="Players", title="No. of players amoung different nationality")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('paper'); sns.despine(); plt.show()
7. Players by Region
graph = sns.countplot(x='region', data=dataset, palette="vlag")
graph.set(xlabel="Region", ylabel="Players", title="No. of players amoung various regions")
sns.set_context('paper'); sns.despine(); plt.show()
8. Name vs FPL Points (top 50)
plt.figure(figsize=(16,6))
graph = sns.barplot(x='name', y='fpl_points', data=dataset[:50], palette="pastel")
graph.set(xlabel="Name", ylabel="FPL Points", title="Name VS FPL points")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('poster'); sns.despine(); plt.show()
9. Name vs FPL Value (top 50)
plt.figure(figsize=(16,6))
graph = sns.pointplot(x='name', y='fpl_value', data=dataset[:50], palette="dark")
graph.set(xlabel="Name", ylabel="FPL Value", title="Name VS FPL value")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('notebook'); sns.despine(); plt.show()
10. New Foreign (Count)
graph = sns.countplot(x='new_foreign', data=dataset, palette="dark")
graph.set(xlabel="New Foreign", ylabel="Amount", title="How many are new signing from a different league")
sns.set_context('notebook'); sns.despine(); plt.show()
11. New Foreign (By Name)
plt.figure(figsize=(20,6))
graph = sns.pointplot(x='name', y='new_foreign', data=dataset[:100], palette="dark")
graph.set(xlabel="Name", ylabel="New Foreign", title="Whether a new signing from a different league")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('notebook'); sns.despine(); plt.show()
12. New Signing (Count)
graph = sns.countplot(x='new_signing', data=dataset, palette="rocket")
graph.set(xlabel="New Signing", ylabel="Amount", title="How many are new signing ")
sns.set_context('notebook'); sns.despine(); plt.show()
13. New Signing (By Name)
plt.figure(figsize=(20,6))
graph = sns.pointplot(x='name', y='new_signing', data=dataset[:100], palette="bright")
graph.set(xlabel="Name", ylabel="New Signing", title="Whether a new signing")
graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
sns.set_context('notebook'); sns.despine(); plt.show()
Feature Selection
For modeling, I use the following five predictors:
dataset = pd.read_csv('data.csv')
X = dataset[['age', 'fpl_value', 'fpl_points', 'page_views', 'position_cat']]
Y = dataset['market_value']
Why these?
-
age– price typically varies with age/prime years. -
fpl_value, fpl_points– performance and fantasy value often correlate with perceived market value. -
page_views– a soft proxy for popularity/visibility. -
position_cat– price dynamics differ by position.
Train/Test Split + Scaling
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=0
)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
- Why scaling? Linear models are sensitive to feature scales; standardization helps stable coefficients and convergence.
Model: Linear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
Make predictions on the test set:
Y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': Y_test, 'Predicted': Y_pred})
df.head()
To see predictions across the full dataset:
X1 = sc_X.transform(X)
Y_pred1 = regressor.predict(X1)
Output: -
array([ 6.28665327e+01, 4.63344124e+01, 1.71999185e+01, 2.70542543e+01,
1.67576333e+01, 2.31341678e+01, 3.06614290e+01, 1.26441319e+01,
1.88859345e+01, 1.61778959e+01, 1.68413424e+01, 1.87550870e+01,
1.53217919e+01, 2.00342164e+01, 5.03340141e+00, 9.45484552e+00,
8.81515716e+00, 1.65135952e+01, 2.15921316e+01, 1.14508041e+01,
4.94468838e+00, 2.20113374e+01, 5.06376105e+00, 9.02558975e+00,
5.66032107e+00, 5.82966111e+00, 1.40691543e+01, 3.47309764e+01,
2.54661370e+01, 3.17429798e+01, 9.84330289e+00, 6.72138221e+00,
1.06760045e+01, 1.16738703e+01, 1.04311591e+01, 1.11660553e+01,
4.62853856e+00, 1.29618040e+01, 8.85989508e+00, 2.93968695e+00,
1.09260466e+01, 1.40564779e+01, 5.83211310e+00, 1.66622152e+00,
6.92188866e+00, 5.12775349e+00, 4.77049269e+00, 8.60744252e-01,
1.76621052e+00, 5.65940283e+00, 3.01078473e+00, 6.11320445e+00,
4.77661984e-01, 6.86850732e+00, 4.09268232e+00, 4.45776943e+00,
-1.21186324e+00, -4.32716636e+00, 2.27473673e+00, -1.88866900e+00,
1.57521202e+00, 1.59605849e+00, 9.93010176e+00, 1.14337555e+00,
4.70114334e-01, -5.69705594e-01, 5.49109710e+00, 2.42127613e+00,
1.60229262e+00, 1.68421329e+00, 6.44917242e+00, 5.78379753e+00,
9.27755719e-01, 1.58683879e+00, 1.39307739e+01, 1.21129834e+01,
1.60684486e+01, 6.89538099e+00, 5.06714267e+00, 6.16871528e+00,
7.97262607e+00, 1.09000523e+01, 6.59496956e+00, 8.34852473e+00,
7.41193892e-01, 2.90238204e+00, 4.10618827e+00, 1.04340149e+01,
5.29097527e+00, 1.77164410e+00, 9.12243523e-01, -1.71626246e+00,
5.39393523e+01, 5.02586215e+01, 2.33287914e+01, 3.38398821e+01,
2.27397252e+01, 2.76023255e+01, 2.06741282e+01, 2.40857909e+01,
2.55614322e+01, 1.97780734e+01, 2.45654940e+01, 1.00361670e+01,
2.12049333e+01, 8.44089635e+00, 2.72273014e+01, 1.32175831e+01,
1.00032029e+01, 2.07556329e-02, 1.34878506e+01, 8.99831431e+00,
2.46911471e+01, 2.84420802e+01, 1.37693399e+01, 1.39753539e+01,
4.14940785e+00, 7.89296807e+00, 4.59603873e+00, 9.33580921e+00,
9.17322247e+00, 5.91166865e+00, 7.20501161e+00, 1.62985464e+00,
9.77106295e+00, 5.24199276e+00, 6.77506385e+00, -1.10826023e-01,
6.70876820e+00, 1.51229913e+00, 3.64913411e+00, 5.86341313e+00,
-1.71934842e+00, 2.64918075e+01, 1.53767278e+01, 2.08364791e+01,
1.28360458e+01, 1.42769184e+01, 1.83438607e+01, 8.39739780e+00,
1.54972711e+01, 9.24879295e+00, 8.84735296e+00, 4.27543923e+01,
1.00334222e+01, 6.67323145e+00, 4.24451728e+00, 1.64599406e+01,
8.93319695e+00, 1.36017222e+01, 9.23623029e+00, 4.75455136e+00,
5.70086565e+00, 5.26865228e+00, 6.05140101e+00, 1.27525174e+01,
3.33916737e+00, 5.90391575e+00, 2.80368956e+00, 1.71560146e+01,
1.77938304e+01, 4.82038725e+00, 4.59853329e+00, 3.02243318e+00,
2.64504042e+00, 6.44877606e+00, 3.59123084e+00, -8.69915834e-01,
2.93424055e+00, -2.94119546e+00, 3.14294737e+00, 3.46717211e+00,
8.45537926e+00, 2.31326804e+00, 5.60399756e-01, 3.76573016e+00,
1.30656073e-01, 1.83081491e+00, -2.04944876e+00, 2.08635844e-01,
-6.24129449e-01, 8.17555691e+00, 9.17157126e+00, 7.59654639e+00,
5.81263224e+00, 2.15563645e+00, -4.78803512e-01, 4.23712894e+00,
8.80441122e+00, 3.50444501e+01, 3.07869954e+01, 1.59727357e+01,
9.11841002e+00, 1.07657645e+01, 9.27215255e+00, 6.23564852e+00,
1.96676802e+01, 1.10408239e+01, 8.46461158e+00, 4.94055844e+00,
6.76284129e+00, 1.13816079e+01, 1.06885705e+01, 6.05253148e+00,
2.99910714e+00, 1.44952259e+01, 3.61337549e+00, 2.60868278e+00,
7.41538690e+00, 2.92184993e+00, 4.45153531e+00, 3.78618026e+00,
8.64074595e+00, 3.50352917e+01, 4.00189086e+01, 4.30005449e+01,
2.43905161e+01, 2.12379487e+01, 2.52051293e+01, 1.54203647e+01,
1.33076223e+01, 1.40963410e+01, 1.33599626e+01, 1.71506799e+01,
2.39610467e+01, 1.30875717e+01, 2.62851963e+01, 4.77089334e+00,
5.82049375e+00, 1.00155278e+01, 1.43421271e+01, 8.23561008e+00,
5.30518579e+00, 7.40927061e+00, 5.53990619e+00, 7.30884677e+00,
5.55237174e+00, 2.70476182e+01, 7.80253031e+00, 6.98958048e+00,
4.33994048e+01, 5.54702805e+01, 3.20615354e+01, 2.21769627e+01,
2.53293890e+01, 3.42494372e+01, 1.31454153e+01, 1.21667107e+01,
1.97419795e+01, 6.30015138e+00, 8.82081099e+00, 5.06400874e+01,
1.57180792e+01, 1.44266939e+01, 2.25336943e+01, 1.35383098e+01,
5.05587301e-01, 3.04413219e+00, 1.27785472e+01, 2.29855522e+01,
5.81554453e+01, 2.31388392e+01, 2.16422756e+01, 5.10021411e+01,
2.18816842e+01, 2.28615104e+01, 1.47747468e+01, 1.52930900e+01,
3.17042563e+01, 1.46778860e+01, 3.35870220e+01, 3.10865733e+01,
1.64592195e+01, 1.69076901e+01, 1.11555766e+01, 1.29238188e+01,
9.61917212e+00, 1.25204674e+01, 5.30423360e+00, 5.54871109e+00,
1.00654973e+01, 4.98946114e+00, 7.42955630e+00, 6.31904380e+00,
1.30403887e+01, 1.00392522e+00, 6.06577036e+00, 6.72085484e+00,
3.73134019e+00, 4.18149614e+00, 4.22383012e+00, 1.91381534e+00,
-1.28088146e+00, 3.41045955e+00, 1.79497319e+00, 1.05829507e+01,
9.43534298e+00, 3.28532558e+00, 1.35250312e+00, 4.38494855e+00,
2.07244010e+00, 1.75906926e+00, 1.43509874e+01, 6.87198964e+00,
6.13033140e+00, -2.33888264e+00, 1.35990916e+01, 1.80634481e+01,
1.46927448e+01, 1.44239848e+01, 1.12612981e+01, 1.29791664e+01,
6.24648586e+00, 9.03304108e+00, 5.25226415e+00, 1.51571160e+01,
1.41645531e+01, 1.01954205e+01, 9.91526470e+00, 1.17886001e+01,
3.27217678e+00, 5.94602737e+00, 2.11646419e+01, 4.44317689e+00,
6.10421380e+00, 2.47182931e+00, 9.65090120e-01, 5.44983836e+00,
5.29387882e+00, 1.22565818e+01, 1.78140601e+01, 7.59876326e+00,
9.32675979e+00, 5.76379440e+00, 9.09217109e+00, 1.39168840e+01,
9.92486243e+00, 1.19482597e+00, 5.87981968e+00, 3.78964322e+00,
4.60747657e+00, 6.09054273e+00, 7.18585630e+00, 5.48772878e+00,
1.01458041e+01, 2.99328230e+00, 1.21505520e+01, 3.70836344e+00,
1.08663018e+01, 3.86679739e+00, 7.27572286e+00, 2.95082669e+01,
2.28551986e+01, 8.11729167e+00, 1.09926147e+01, 1.23986210e+01,
6.04083954e+00, 5.54532532e+00, 5.13323449e+00, 8.13637450e+00,
5.03507346e+00, 6.15626580e+00, 5.97910234e+00, 5.35986738e+00,
-2.24795766e-01, 6.53422485e+00, -1.30726559e+00, 4.22683927e+00,
3.46130466e+00, 2.62146976e+00, 5.16180438e+00, -6.12786456e-01,
1.87310775e+00, 2.18184157e+00, 6.84458483e-01, 1.15822758e+01,
5.43069136e+01, 6.54671081e+01, 3.79330854e+01, 3.21481995e+01,
1.72037115e+01, 1.59411311e+01, 1.74161390e+01, 1.11834417e+01,
1.28956970e+01, 1.40528607e+01, 1.94235419e+01, 9.06373403e+00,
2.21515945e+01, 1.22072454e+01, 1.16944338e+01, 9.81890962e+00,
1.34160836e+01, 6.47409346e+00, 8.09325423e+00, 5.34652950e+00,
1.15204048e+01, 1.72578040e+01, 6.88453524e+00, 7.38962753e+00,
3.82242989e+00, 4.63910853e+00, 3.52678463e+00, 4.17896473e+00,
7.89789526e+00, 8.12813073e+00, 1.27843162e+00, 9.97198566e+00,
8.03247501e+00, 6.45394674e+00, 5.12676355e+00, 3.96953706e+00,
9.13592851e+00, 1.70473040e-01, 6.70820319e+00, 4.63104526e+00,
4.59259892e+00, 3.44236036e+00, 2.69420533e+00, 6.32927942e+00,
7.78319819e+00, 1.53732744e+01, 1.28845013e+01, 7.87680461e+00,
1.26908110e+01, 9.99255503e+00, 8.46245403e+00, 8.29923735e+00,
8.85993621e+00, 8.23448929e+00, 8.17877211e+00, 1.09487410e+01,
2.08215886e+00, 4.43309931e+00, 5.53899424e+00, 3.45700961e+00,
4.43442418e+00, -3.65736914e-01, 3.47071664e+00, 1.49935272e+01,
1.97778766e+01, 2.34133710e+01, 9.59576755e+00, 9.65451702e+00,
1.81178362e+01, 7.31740896e+00, 8.88841637e+00, 7.79515304e+00,
3.52915171e+00, 1.30045353e+01, 6.92258405e+00, 8.99422402e+00,
4.63065558e+00, 7.37182795e+00, 4.50268052e+00, 7.47405175e+00,
5.66246456e+00, 6.41619652e+00, 6.22417826e+00, 3.21996267e+00,
5.11293760e+00])
Evaluation: 10-Fold Cross-Validation
from sklearn.model_selection import cross_val_score
accuracy = cross_val_score(estimator=regressor, X=X_train, y=Y_train, cv=10)
print(accuracy.mean())
print(accuracy.std())
Metric: By default,
LinearRegressionwithcross_val_scoreuses the estimator’s.score()which is R².-
Report:
-
CV Mean R²:
{{CV_MEAN_R2}} -
CV Std:
{{CV_STD_R2}}
-
CV Mean R²:
If you prefer error metrics, add: from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score and compute MAE/RMSE/R² on the held-out test set.
What I Learned
- Even a small, clean feature set can drive a reasonable baseline.
-
fpl_valueandfpl_pointsusually show strong signal; if you have access to richer performance data (minutes, xG, assists/90, age-curve features), add them. -
page_viewscaptures attention, which influences pricing; try other popularity proxies. - Consider regularized models (Ridge/Lasso) or tree ensembles (RandomForest, XGBoost) and compare CV scores.
- Plot residuals vs. predicted to check for systematic under/over-valuation (especially on very high-value players).
Follow-Up Questions
I’d love to hear your thoughts!
- Which additional features would you include to improve prediction accuracy?
- Do you think football player value is more influenced by performance or popularity metrics?
- Would you like to see a version of this project using RandomForest/XGBoost?
- Should I deploy this model as an interactive web app where you can enter player stats and get predictions?
Feel free to comment below — I’d love to discuss and expand this project further!
Connect With Me
Let’s learn and build cool data projects together!
- 💼 LinkedIn: https://www.linkedin.com/in/singla-khushi/
- 🔗 GitHub: https://github.com/KhushiSingla-tech
- 📩 Comments below are always welcome!















Top comments (0)