Expected goals (xG) from scratch, and why the scoreboard lies

#datascience #machinelearning #logisticregression #python

On 22 November 2022, Argentina, one of the favourites to win the whole World Cup, lost their opening game to Saudi Arabia, a team almost nobody gave a chance. One to two. It went down as one of the biggest upsets in the tournament's history, and the story everyone told was simple: Saudi Arabia were the better team that day.

They weren't. Not even close. And you can prove it with a single number that quietly runs modern football: expected goals, or xG. The final score is the most important number in the sport, and also one of the least honest, a whole game can go by with one team battering the other and the scoreboard says the wrong thing. So analysts built a number that doesn't. Let's build it too, from scratch, and use it to put this upset on trial.

The one idea: a shot is a probability

Forget goal or miss for a second, that's the noisy part. Ask the better question: if you took this exact shot a hundred times, how many would go in? That number, between 0 and 1, is the shot's xG. A tap-in is 0.9. A hopeful effort from the halfway line is 0.01.

Two things mostly decide it: how far you are from the goal, and how much of the goal you can see, the angle between the two posts from where you're standing. Close and central, the goal is a barn door. Out wide and far, it's a letterbox. That's our whole input, two numbers per shot.

import math

def features(x, y):                       # pitch coords, goal at x=120
    dist  = math.hypot(120 - x, 40 - y)
    a = math.hypot(120 - x, 36 - y)       # to the near post
    b = math.hypot(120 - x, 44 - y)       # to the far post
    angle = math.acos((a*a + b*b - 64) / (2*a*b))
    return dist, angle

Squash it into a probability

Now we turn "far and narrow" into an actual chance. A sigmoid takes any number and squashes it into the range 0 to 1, big positive numbers go near 1, big negative numbers go near 0. Wrap it in a tiny model with a weight per feature plus a bias, and you have a logistic regression: a smart straight line through the data.

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

def xg(shot, w, b):
    d, ang = features(shot.x, shot.y)
    z = b + w.dist * d + w.angle * ang
    return sigmoid(z)

We learn the weights w and bias b the usual way: run every shot through the model, compare the predicted probability to whether it actually went in, and nudge the weights to shrink the error. Do that over a pile of real shots and the model teaches itself what a good chance looks like.

Score the match

One shot is a party trick. The magic is doing it to all of them. Train on every shot at the 2022 World Cup, then score every attempt in Argentina vs Saudi Arabia and add them up. Argentina come out at about 2.52 expected goals. Saudi Arabia, about 0.21.

So the scoreboard says Argentina lost 1-2. The math says that on an average day, they win this two or three to nothing. The team that went on to lift the trophy lost its opener to a side it completely dominated.

But xG lies too

If I stopped there I'd be doing the exact thing I just complained about. xG lies too, and you should know exactly how.

xG does not know that Saudi Arabia's keeper had the game of his life. It doesn't know about the deflection, or the one-in-a-hundred finish. It tells you what should happen on an average day, not what did happen on this one. Football is a low-scoring, high-variance sport, so on any single match the gap between xG and the scoreboard can be enormous. A team can rack up three xG and lose. That's not a bug. That's the whole reason the sport is fun. xG measures the process; the scoreboard measures the luck on top.

And yet it works

So why trust it? Because over one game it's noisy, but over many it's scarily accurate. Run the model over all 1,430 shots at the tournament: 152 goals were actually scored, and the model expected 152.8. Shot for shot, it lands right on top of the professional models clubs pay real money for. Don't bet your life on one match; over a season, xG will tell you which teams are about to fall off a cliff and which are quietly elite, long before the table does.

How the pros go further

Our model used two numbers. Real xG keeps adding context, and it's the same idea each time, just more features feeding the same kind of model:

Body part: a header is worth less than a foot from the same spot.
Defenders in the cone: bodies between the ball and the goal pull the number down.
Keeper position: an open net sends it way up.
Big chance: a flag for the sitters.

Two more honest details. First, model choice: we used logistic regression because you can read every line. The pros often reach for gradient boosting, hundreds of little decision trees stacked together, each fixing the last one's mistakes. It buys a sliver more accuracy (think 0.86 vs 0.84) at the cost of being a black box, and our readable version still lands within a whisker. Second, special cases: some shots break the rules. A penalty is the same shot every time, twelve yards, no defenders, so it just gets a flat number, about 0.76. Set pieces get their own treatment. The model quietly learns these are a different animal.

The takeaway

A shot is a probability. Two numbers, a sigmoid, and a pile of real data, and you can measure something the scoreboard hides. And the engine under the hood, logistic regression, is the same workhorse behind spam filters, medical risk scores, and half the models you'll ever build. Learn it here and you've learned it everywhere.

If you want the full build, the real dataset, the feature engineering, and the validation, it's the data science track on IWTLP, where you build the things behind the tools you use, from scratch.