This fall I took the course Mathematical Modelling of Football from Uppsala University. It was taught by Professor David Sumpter, and I believe this is the first academic course of its kind. The main subjects covered are modelling and analysis of events (on the ball actions), movement and pitch control (tracking data), player evaluation, and match result simulations. There were also several guest lectures from (among others) William Spearman, lead data scientist at Liverpool FC, and Javier Fernández, head of sports analytics at FC Barcelona.
The tools used were Python (using Anaconda) with NumPy, Pandas and Matplotlib. The course was a lot of work, especially the assignments, but I really enjoyed it and learned a lot.
I found out about the course from a tweet from Professor Sumpter this summer. The course sounded interesting, so I signed up. I was qualified to officially register for it, but almost all the course material is freely available even if you don’t register for it.
Since I already know Python, have taken many mathematics courses before, and like football, I thought the course would be fairly easy for me. But I was wrong. I have used NumPy and Pandas before, but only in the course Computational Investing, so it was mostly new to me. Also, a lot of the mathematics was using various statistical regressions to fit data to functions. You didn’t need to know the details of how it worked to use it, but it is an area of mathematics I was not familiar with. However, the biggest reason the course was a lot of work for me was that the assignments required a lot of work.
I started following Professor Sumpter on Twitter after I read his book Soccermatics. The book combines football and mathematics in a really interesting way, as I wrote in my review of it. In addition to giving this course, Professor Sumpter is also a data analyst for the Swedish team Hammarby in Stockholm. This is great, because he then knows how football analytics is used in practice, not just in theory. Professor Sumpter is also very involved in the group Friends of Tracking, which has a lot of overlap with the course.
Most of the course is using publicly available data sets of matches from seasons and tournaments a few years back. I used data from the 2018 World Cup, the Premiere League and La Liga 2017/18 seasons, as well as games played by the Swedish club Hammarby in 2019.
Event data for a game is anything that happens with the ball. Examples of events from the data provider Wyscout are: pass, cross, free kick, air duel, shot, throw in, corner, foul. For each event there is a time stamp with sub-second resolution, beginning and end coordinates, player, team and match ids. In the Premiere League season we looked at, there were around 1700 events in a game, so on average almost one event every three seconds.
In the course there is an emphasis on analyzing and visualizing data. There are many example programs available on GitHub. In many lectures there is a code walkthrough, explaining how the code works. This was very useful when working on the assignments.
The very first lecture the first week goes through how to load and process event data, and how to plot actions. The example used is where on the pitch shots on goals were taken in a game. Then there is a great guest lecture from Peter McKeever that shows, step by step, how to make a professional visualization of Arsenal’s goal difference (for vs. against) over 10 seasons.
The next week covers statistical models of actions. An example of this is Expected Goal (xG). The xG value gives the probability that a shot from a given position on the pitch will result in a goal. For example, if the xG value is 0.2 (20%), then 1 out of 5 shots from that position will typically result in a goal. To build an xG model, you start with event data of all shots from at least one season of a league. For each shot on goal, the event data has the x and y coordinates of where the shot was taken from, as well as if it resulted in a goal or not. To build the model, you need theory of what factors influence it. In the lecture we assumed that the distance to goal, as well as the angle to the goal will influence whether a goal is scored. Since the outcome of the xG function should be a probability, you can use a binomial logistical regression to find which parameters best fit the data. When the parameters have been found, you have a function where you can input the distance to goal and angle to goal, and the result is the probability that a shot from that position will be a goal.
The third part of this section discusses how to simulate match results. Professor Sumpter starts by showing (from event data) that goals are equally likely to be scored at every minute of a game. In recent years, there are on average 2.7 goals per match. This can be modelled as a Poisson process with a given rate. By comparing the actual number of goals scored to that of the Poisson process, we see that they match really well. By taking all the match results of a complete season and fitting a Poisson regression to the results, you get the rates at which each team scores against every other team, both at home and away. With these rates, it is simple to simulate a whole season and see where the different teams end up in the table. A really nice write-up of this approach is this blog post.
Event data records actions related to where the ball is. Tracking data on the other hand gives the position of the ball and all the players 25 times a second for the whole match. So there is much more data to process compared to event data. With the tracking data, you can find the speed and acceleration of a player at any given moment.
You can also make models for pitch control. If the ball appears at a given point, which team will get to it first. In the simplest case, for each position on the pitch, you measure which team’s player is closest to that point. If a player from the green team is closer, then the green team “controls” that part of the pitch. Doing this for all points on the pitch gives you a map of which space the green team controls, and which space the other team controls.
A more advanced model would consider the direction and speed each player is running with, as well as using a probability model for how likely the first player arriving at the ball is to control it.
The final section of the course starts with models for assessing an individual player’s contribution. A very simple model is a plus-minus model. Players on the pitch are favored when a goal is scored, and penalized when a goal is conceded. However, there are many problems with this model. In football there are relatively few goals scored in a game (as compared to for example basketball), so it is hard to draw conclusions from them. Also, a plus minus model does not account for the skill of the opposition team, or the skills of the team mates.
A better approach is to use event data and look at various aspects, such as goals scored, pass completion rates, interceptions, tackles etc. These different aspects can be combined into a player radar. It shows how good a given player is in each category, compared to the other players in the league in the same position. A good example of this kind of analysis (mentioned in the lecture) is this blog post.
Next, Markov models of how the ball is passed are presented, but these don’t give much insight into how good the team and players are. The next idea presented is possession chains – you analyze all the events leading up to a goal, in order to find actions that contribute to the goal. Finally, models where you combine pitch control with the expected value of an action (with the ultimate goal of scoring) are presented. There is also a lecture on how player fitness data is analyzed and used.
A big part of the course was to apply what you have learned by modifying and writing programs to analyze data sets. You could choose either Python or R. My feeling was that most people used Python, probably because all the examples in the lectures used Python. Professor Sumpter also used Anaconda for interactive coding and variable inspection. I had not done much interactive, exploratory coding in Python before, but I quite liked it. The variable inspection was especially nice, particularly for large and deeply nested JSON structures.
The first two assignments were 10 points each, and the final project 20. My grade was 31 out of 40 (77.5%), which I am quite happy with.
The first assignment was to plot and analyze the actions of a player in the recent men’s or women’s World Cup. It was only to get us going, and did not count towards the grade of the course.
I remember thinking that Luka Modric passed really well in the 2018 World Cup. So I plotted his passes in the final – France vs. Croatia (4 – 2). I also wanted to see if he was an “important” passer in Croatia, so I counted all passes in the final by each player (from both teams). It turns out that he made the second most passes (77) in Croatia. The most passes (99) were by Marcelo Brozovic.
The top six passers in that game were all from Croatia. The first French player was Paul Pogba with 34. Croatia had 554 passes versus France’s 292. However, passing a lot does not necessarily mean that you will win, as the result demonstrates. So there needs to be some sort of quality measure on the passes as well for them to tell more of the story.
This assignment is about evaluating passes in order to identify players that are good at passing. The Wyscout event data has all the passes that were made in a complete season, including whether it succeeded or not. However, some passes (for example into the final third) are harder than other (for example in your own half). So we want to find players that are good at making difficult passes.
The first step is to create a statistical model (logistic regression for example) that predicts pass success as a function of where on the pitch the pass is taken. Next, the model should be improved by adding other parameters, such as length, or distance to the opponent goal. Using this model, you should then use it to identify players that are good at making difficult passes. You should hand in runnable, commented code as well as a two-page document with explanations, plots and results.
A big problem initially here for me was slow code. The sample code used a loop over all passes. But when running on a full season it would slow to a crawl. This looked like it was caused by memory allocation problems when adding to a large dataframe, and in the end I got around it by pre-allocating it. There were also numerous other difficulties mostly related to how to use NumPy and Pandas correctly and efficiently. I got a lot of help from the Slack channel, but I still spent a lot of time on this assignment.
Here we should use match data from Signality to analyze Hammarby games from the summer of 2019. The first task is to calculate the speed and acceleration of all players, and to plot the positions and speeds of all players at four different occasions. Next, we have to plot the speed, acceleration and distance from goal for players for 15 second intervals. Finally, we have to plot the distance to closest team mate and closest opposition player in a sequence leading up to a goal.
It was difficult to be sure about the coordinates that Signality used – I couldn’t find any official documentation. Finding the times of when the goals were scored was also difficult, as well as to sync the tracking data to the events.
The final project was done in groups. Different groups were to analyze a club from three different perspectives: club performance next season, player performance, and player fitness. My group (5 people) worked on analyzing Real Madrid, based on the match results of the 2017/2018 season of La Liga. We used the Dixon-Coles model for match results, which is an adjustment of the basic Poisson model. The grading was based on the written report, as well as on a 10 minutes presentation given to the other groups. In addition to this, each person also had to write a document listing pros and cons of the method used.
Our group cooperated quite well. We used Zoom meetings and Slack to coordinate, and divided up the tasks between us.
For all the assignments, there was example code to start from. However, it usually needed to be changed (for example pass success instead of expected goal). Also, the different data providers use different coordinates for positions – sometimes absolute numbers, sometimes as percentages of the pitch size. And with different starting points for the coordinates. So some transformation was always needed.
There were many things to like about this course. Professor Sumpter did a very good job of explaining the material. He was also always very cheerful and pleasant to listen to. The fact that he is also working as an analyst for Hammarby, a club in the highest division in Sweden, is great. It ensures that he has practical experience from using the various methods. The guest lectures were also impressive, from some of the leading football analysts in the world!
I also enjoyed using Anaconda and the more exploratory way of programming, where you execute snippets of code and look at the results. I really liked the variable inspection, but missed some of the more advanced features and shortcuts from PyCharm (I suspect I could instead have used PyCharm in a similar way, but it was also good to work with Anaconda). I also learned more of NumPy, Pandas and Matplotlib, which was great.
It would have been good if the example code had good, idiomatic examples of how to use Pandas array operations efficiently. The course was fully remote, which was great. All the main lectures were available on the course page. However, some of the guest lectures and the tutorials session were not. Since I am working and only took the course in my spare time, it was not always possible for me to attend the live-only streams, which means I missed out on some of the content. It would be great if everything was available to watch at a later time.
Our assignments were graded by Professor Sumpter, but they were also peer-reviewed by three people taking the course. This is a good idea – I learned a lot by looking at other students’ code and reading their reports. However, not everybody did their peer reviews – I only received feedback from one person in total.
The system used by Uppsala University is called Instructure and is not very good. It was quite confusing and it was hard to know where to look for information. For example, the peer reviews you were supposed to do were not shown clearly enough (maybe explains why I got so few reviews). The chats in Instructure for the course were mostly deserted too. However, there were very active Slack chats mostly set up for the people who were not officially taking the course. I was stuck several times with weird NumPy/Pandas problems, and when I asked in Slack I got really helpful answers within a few hours.
I signed up for the course because I am interested in both football and mathematics, so it sounded like a good fit. In addition to watching some football, I also (still) like to play myself. At work I am part of a team that plays in a “Sunday league” (Korpen in Swedish). I also play Sunday nights with friends. So I am also interested in what I can learn from a course like this that can be applied to when I play myself. The most obvious lesson for me is how unlikely you are to score if you are far away from the goal (from the expected goals calculations).
I was also coaching my son’s football team for many years, before they got too advanced for me. In addition to being a lot of fun, playing yourself is also very humbling. It makes you appreciate how difficult something that looks easy from the sidelines really is.
I am also always on the lookout for good resources on how to understand tactics better, and tips on how to play better. Michael Cox has written two really good books on the evolution of the game, both in England and in Europe. And Dan Blank has written several good books with tips on how to play better, for example Soccer IQ. If anybody has any other good resources to recommend, I am all ears.
I really enjoyed taking Mathematical Modelling of Football, even though it turned out to be a lot more work than I expected (and only worth five academic points). The main reason it was so much work was because the assignments were realistic tasks – you had to work with real, large data sets to get insights from the analysis.
It is also an exciting field where there is a lot of development, and Professor Sumpter seems to be well connected with many of the key contributors. After the final assignment presentations, Professor Sumpter tweeted that he was impressed by the results, and if any clubs needed data analysts, they should contact him for recommendations. So I think Professor Sumpter was also happy with the course!