Word on the street? Deepnote is now open-source! Jakub Jurových, the Founder of Deepnote, announced on LinkedIn this week.
During my research, I found out that the European Soccer Database from Kaggle has 25K+ matches, 10K+ players, and 2008–2016 seasons across 11 leagues. In this guide will use Deepnote in Windsurf to analyse the European sports data from Kaggle for three use cases:
- To find out which team improved most post-2012 from the data?
- To predict "star players" and their value, with the Top 10 players by goal impact
- To analyse win rates, goal differences, and home/away advantages over seasons for the teams
Let's get started!
Prerequisites
You will need the following to proceed
- A Kaggle account
- Windsurf account and IDE installed on your computer
- A Deepnote account
What is Deepnote?
Deepnote is a cloud-based, collaborative data science notebook platform that enhances Jupyter notebooks with real-time co-editing, integrated data connections, and simplified sharing, acting as a central workspace for teams to develop and deploy data-driven products. It supports Python, R, and SQL
To get started, we will download the European soccer database from Kaggle. Click on the Download button like so
Then click on Download dataset as zip
When the dataset is downloaded on your computer, extract it to your preferred directory. For this project, I created a directory on my computer called windsurf_deepnote. You can do the same and extract the dataset to that specific folder.
After the dataset is extracted, you get a file named database.sqlite in the directory.
Now that we have the dataset, lets setup Windsurf.
Setting up Windsurf IDE
Windsurf is an integrated development environment (IDE) for coding. It is similar to VS Code, with AI-first features.
Go to the Windsurf website, and download the application suited for your operating system (OS).
After it has downloaded, sign up with your preferred method, and open the Windsurf IDE with the directory you want to work with.
You can also view your Windsurf account activity in the profile section like so
Using Deepnote in Windsurf
Now we have the dataset and Windsurf setup, let's proceed to working with Deepnote in Windsurf.
Sign up here to create your Deepnote account
In the Windsurf IDE, go to the extensions tab at the left sidebar, search for Deepnote and click on the Install button like so
After it has installed, this tab loads. It has information on how to get started using Deepnote in Windsurf
Next, use the Ctrl+Shift+P command to create a new Deepnote project, and name it soccer.

It then creates a new project called soccer and loads a new Deepnote block, just as in the Deepnote web app UI
We are using Kaggle to get the football/soccer data that we will be analysing.
You can enter python code in blocks and run like so
If you want to convert a notebook to a deepnote notebook, open your terminal and run this command:
npx @deepnote/convert notebook.ipynb # This will convert the notebook and create notebook.deepnote
Enter yes (y) for the prompts that come up during installation
After it has been installed, Then open your .deepnote file in Windsurf IDE.
Implementing the data analysis of the three use Cases
From the soccer dataset, we will run each of this Python codes to analyse the data.
Find Teams That Improved Most Post-2012
We will find which teams improved most after 2012, by comparing average team performance (win rate, goals) before and after 2012
Enter this code:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
# Connect to the SQLite database
conn = sqlite3.connect("database.sqlite")
# Load match data
matches = pd.read_sql_query("""
SELECT
m.id, m.season, m.home_team_api_id, m.away_team_api_id,
m.home_team_goal, m.away_team_goal,
t.team_long_name AS home_team
FROM Match m
JOIN Team t ON m.home_team_api_id = t.team_api_id
""", conn)
# Compute win rate per team per season
matches["result"] = matches.apply(
lambda x: "win" if x["home_team_goal"] > x["away_team_goal"] else
("loss" if x["home_team_goal"] < x["away_team_goal"] else "draw"), axis=1)
team_stats = matches.groupby(["home_team", "season"])["result"].value_counts().unstack().fillna(0)
team_stats["win_rate"] = team_stats["win"] / (team_stats["win"] + team_stats["loss"] + team_stats["draw"])
# Split pre- and post-2012 averages
pre_2012 = team_stats[team_stats.index.get_level_values("season") < "2012/2013"].groupby("home_team")["win_rate"].mean()
post_2012 = team_stats[team_stats.index.get_level_values("season") >= "2012/2013"].groupby("home_team")["win_rate"].mean()
improvement = (post_2012 - pre_2012).dropna().sort_values(ascending=False).head(10)
# Visualize
plt.figure(figsize=(10,5))
improvement.plot(kind="bar", color="green")
plt.title("Top 10 Most Improved Teams Post-2012")
plt.ylabel("Win Rate Improvement")
plt.tight_layout()
plt.show()
After we run the code block above, we get this visualization:
From the data, we confirm that Juventus showed the biggest leap after 2012, improving their win rate by +30%.
Borussia Mönchengladbach and AS Saint-Étienne followed closely, both gaining around 25%, proving that the post-2012 era rewarded better management, new ownership money, and tactics.
Predict Star Players by Goal Impact
Next we predict star players and rank the Top 10 by goal impact score, by combining player attributes and match stats, then using regression to find impact.
Enter this python code block and run it:
import sqlite3
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
conn = sqlite3.connect("database.sqlite")
# Load player data with names
players = pd.read_sql_query("""
SELECT p.player_api_id, p.player_name,
pa.overall_rating, pa.potential
FROM Player p
JOIN Player_Attributes pa ON p.player_api_id = pa.player_api_id
""", conn)
# Load match data
goals = pd.read_sql_query("""
SELECT home_player_1, home_player_2, home_player_3, home_team_goal
FROM Match
""", conn)
# Aggregate goals by player
player_goals = goals.melt(
value_vars=['home_player_1', 'home_player_2', 'home_player_3'],
var_name='position',
value_name='player_api_id'
)
player_goals = player_goals.groupby('player_api_id').size().reset_index(name='goal_count')
# Merge with player data and calculate impact score
merged = players.groupby(['player_api_id', 'player_name']).mean().reset_index()
merged = merged.merge(player_goals, on='player_api_id', how='inner').dropna()
# Model goal impact
X = merged[['overall_rating', 'potential']]
y = merged['goal_count']
model = LinearRegression().fit(X, y)
merged['impact_score'] = model.predict(X)
# Get top 10 players
top_players = merged.sort_values('impact_score', ascending=False).head(10)
# Plot with player names
plt.figure(figsize=(12, 6))
plt.barh(top_players['player_name'], top_players['impact_score'], color='orange')
plt.title("Top 10 Star Players by Goal Impact Score", pad=20)
plt.xlabel("Predicted Impact Score")
plt.ylabel("Player Name")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
After running the code, we get this visualisation:
From the data, we confirm that Iker Casillas tops the list as the star player with the highest predicted goal-impact score (~41), followed closely by Sergio Ramos, Manuel Neuer, and Petr Čech. John Terry also rounds out the top spot.
Win Rate & Home Advantage Analysis
Then we analyse win rates, goal differences, and home/away advantages across seasons. We do this by comparing home vs away wins by season and visualize trends.
Run this python code block:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
# Connect to the database
conn = sqlite3.connect("database.sqlite")
# Load match data with team names
matches = pd.read_sql_query("""
SELECT
m.season,
t1.team_long_name as home_team,
t2.team_long_name as away_team,
m.home_team_goal,
m.away_team_goal
FROM Match m
JOIN Team t1 ON m.home_team_api_id = t1.team_api_id
JOIN Team t2 ON m.away_team_api_id = t2.team_api_id
WHERE m.season IS NOT NULL
""", conn)
# Calculate match results
matches['home_win'] = matches['home_team_goal'] > matches['away_team_goal']
matches['draw'] = matches['home_team_goal'] == matches['away_team_goal']
matches['away_win'] = matches['home_team_goal'] < matches['away_team_goal']
# Calculate win rates per team
home_stats = matches.groupby('home_team').agg({
'home_win': 'sum',
'home_team_goal': 'sum',
'away_team_goal': 'sum'
}).rename(columns={
'home_win': 'wins',
'home_team_goal': 'goals_for',
'away_team_goal': 'goals_against'
})
away_stats = matches.groupby('away_team').agg({
'away_win': 'sum',
'away_team_goal': 'sum',
'home_team_goal': 'sum'
}).rename(columns={
'away_win': 'wins',
'away_team_goal': 'goals_for',
'home_team_goal': 'goals_against'
})
# Combine home and away stats
team_stats = home_stats.add(away_stats, fill_value=0)
team_stats['total_matches'] = matches['home_team'].value_counts().add(
matches['away_team'].value_counts(), fill_value=0)
team_stats['win_rate'] = (team_stats['wins'] / team_stats['total_matches']) * 100
team_stats['goal_difference'] = team_stats['goals_for'] - team_stats['goals_against']
# Get top 10 teams by win rate
top_teams = team_stats.nlargest(10, 'win_rate').sort_values('win_rate')
# Plotting
plt.figure(figsize=(12, 8))
# Create bar plot
bars = plt.barh(top_teams.index, top_teams['win_rate'], color='skyblue')
plt.title('Top 10 Teams by Win Rate', fontsize=16, pad=20)
plt.xlabel('Win Rate (%)', fontsize=12)
plt.ylabel('Team', fontsize=12)
# Add value labels
for bar in bars:
width = bar.get_width()
plt.text(width + 0.5,
bar.get_y() + bar.get_height()/2,
f'{width:.1f}%',
va='center',
fontsize=10)
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
# Save the plot
plt.savefig('team_win_rates.png', bbox_inches='tight', dpi=100)
plt.close()
print("Top 10 Teams by Win Rate:")
print(top_teams[['win_rate', 'total_matches', 'goal_difference']].sort_values('win_rate', ascending=False).to_string())
print("\nPlot saved as 'team_win_rates.png'")
After running the code, we get the visualization of the dataset like so:
We confirm that FC Barcelona dominates with a 77% overall win rate across the entire period, followed by Real Madrid (75%) and SL Benfica (74.6%). The data shows the classic home advantage of roughly 55–60% win rate for home teams throughout 2008–2016.
And that's it! We have a great analysis of the soccer data!
Next steps
Show the Deepnote some support and star the GitHub repository !😎
Thank you for reading!















Top comments (0)