mirandaauhl for Tiger Data (Creators of TimescaleDB)

Posted on Jul 30, 2021 • Originally published at blog.timescale.com

Hacking NFL data with PostgreSQL, TimescaleDB, and SQL

#postgres #database #analytics #timeseries

The NFL dataset
Accessing the data
Let's start exploring!
The power of SQL
Faster insights with PostgreSQL and TimescaleDB
Faster queries with TimescaleDB continuous aggregates
Advanced SQL data analysis with TimescaleDB hyperfunctions
Where can the data take you?

Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics – and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience - all with PostgreSQL, standard SQL, and freely available extensions.

Time-series data is everywhere, including, much to our surprise, the world of professional sports. At Timescale, we're always looking for fun ways to showcase the expanding reach of time-series data. Stock, cryptocurrency, IoT, and infrastructure metrics data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for #timeseries or #TimescaleDB, and you're sure to find questions about high-frequency trading or massive scale observability data with tools like Prometheus.

You can imagine our excitement, then, when we happened upon the NFL Big Data Bowl, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models.

Did the NFL really give access to 18+ million rows of detailed play data from every regular season NFL game?

For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials).

Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.

So, at first glance, football game metrics might not immediately jump out as anything special.

But then the NFL did something pretty ambitious and amazing.

All NFL players are equipped with RFID chips that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize. The NFL started tracking and storing data for every player on the field, for every play, for every game.

As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.

This isn’t ordinary data. This is time-series data. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.

Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a multi-node, petabyte-scale, completely free relational database for time-series.

We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season.

Read on for more information about the NFL’s dataset and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.

If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.

If you’re new to time-series data or just have some questions you’d like to ask about the dataset, join our public Slack community, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.

The NFL dataset

Over the last few years, the NFL and Kaggle have collaborated on the NFL Big Data Bowl. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.

Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about last year’s winners.)

(Keep watching the NFL website for more information on the next Big Data Bowl.)

Accessing the data

For the purposes of this blog post and accompanying tutorial, we will use the sample data provided by the NFL. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.

This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets.

If you want to follow along - or recreate! - the queries we go through below, follow our tutorial to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any Timescale Forge instance.)

Let's start exploring!

We've provided the steps needed to ingest the dataset into TimescaleDB in the accompanying tutorial, so we won’t go into that here.

The NFL dataset includes the following data:

Games: all relevant data about each game of the regular season, including date, teams, time, and location
Players: information on each player, including what team they play for and their originating college
Plays: a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.
Week [1-17]: for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the tracking table and totals almost 20 million rows of time-series data.)

In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see our TimescaleDB vs. InfluxDB comparison for reference).

Because TimescaleDB is PostgreSQL with time-series super powers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.

Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!

The power of SQL

Year after year, we see SQL listed as one of the most popular languages among developers on the StackOverflow survey. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.

SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.

Not so! SQL can data munge with the best of them! Let's look at a first, quick example.

Average yards per position, per game

For this first example, we'll query the tracking table (the player movement data from all 17 weeks of games) and join to the game table to determine the number of yards per player position, per game.

The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.

WITH total_position_yards AS (
    SELECT sum(dis) position_yards, POSITION, gameid FROM tracking t 
    GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION;

Number of plays by offensive player

As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.

This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam
GROUP BY a.player_id, pl.display_name, a.team_name
ORDER BY play_count DESC;

player_id	display_name	play_count	team_name
2506109	Ben Roethlisberger	725	PIT
2558149	JuJu Smith-Schuster	691	PIT
2533031	Andrew Luck	683	IND
2508061	Antonio Brown	679	PIT
310	Matt Ryan	659	ATL
2506363	Aaron Rodgers	656	GB
2505996	Eli Manning	639	NYG
2543495	Davante Adams	630	GB
2540158	Zach Ertz	629	PHI
2532820	Kirk Cousins	621	MIN
79860	Matthew Stafford	619	DET
2504211	Tom Brady	613	NE

If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others.

Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name, pl."position"
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam AND pl."position" != 'QB'
GROUP BY a.player_id, pl.display_name, a.team_name, pl."position"
ORDER BY play_count DESC;

So, now we can see the non-quarterbacks who are on offense the most in a season:

player_id	display_name	play_count	team_name	position
2558149	JuJu Smith-Schuster	691	PIT	WR
2508061	Antonio Brown	679	PIT	WR
2543495	Davante Adams	630	GB	WR
2540158	Zach Ertz	629	PHI	TE
2541785	Adam Thielen	612	MIN	WR
2543468	Mike Evans	610	TB	WR
2555295	Sterling Shepard	610	NYG	WR
2540169	Robert Woods	604	LA	WR
2552600	Nelson Agholor	604	PHI	WR
2543488	Jarvis Landry	592	CLE	WR
2540165	DeAndre Hopkins	587	HOU	WR
2543498	Brandin Cooks	581	LA	WR

Sack percentage by quarterback on passing plays

We can start to go a little deeper by extracting specific data from the tracking table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.

Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).

Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team. To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.

First, we select the distinct list of all plays, for each quarterback (qb_plays). The reason we do a SELECT DISTINCT… is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.

With this result, we can then count the number of total plays per quarterback (total_qb_plays), the total number of games each quarterback played (qb_games) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (sacks).

With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.

WITH qb_plays AS (
    SELECT DISTINCT ON (POSITION, playid, gameid) POSITION, playid, player_id, gameid 
    FROM tracking t 
    WHERE POSITION = 'QB'
),
total_qb_plays AS (
    SELECT count(*) play_count, player_id FROM qb_plays
    GROUP BY player_id
),
qb_games AS (
    SELECT count(DISTINCT gameid) game_count, player_id FROM qb_plays 
    GROUP BY player_id
),
sacks AS (
    SELECT count(*) sack_count, player_id 
    FROM play p
    INNER JOIN qb_plays ON p.gameid = qb_plays.gameid AND p.playid = qb_plays.playid
    WHERE p.passresult = 'S'
    GROUP BY player_id
)
SELECT play_count, game_count, sack_count, (sack_count/play_count::float)*100 sack_percentage, display_name FROM total_qb_plays tqp
INNER JOIN qb_games qg ON tqp.player_id = qg.player_id
LEFT JOIN sacks s ON s.player_id = qg.player_id
INNER JOIN player ON tqp.player_id = player.player_id
ORDER BY sack_count DESC NULLS last;

If you're an ardent football fan, the results from 2018 probably don't surprise you.

play_count	game_count	sack_count	sack_percentage	display_name
579	16	65	11.23	Deshaun Watson
602	16	55	9.14	Dak Prescott
611	16	53	8.67	Derek Carr
656	16	49	7.47	Aaron Rodgers
462	15	48	10.39	Russell Wilson
639	16	47	7.36	Eli Manning
448	14	45	10.04	Josh Rosen
659	16	43	6.53	Matt Ryan
386	14	43	11.14	Marcus Mariota
619	16	41	6.62	Matthew Stafford
621	15	38	6.12	Kirk Cousins
324	11	37	11.42	Ryan Tannehill
447	11	36	8.05	Carson Wentz

Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.

play_count	game_count	sack_count	sack_percentage	display_name
725	16	25	3.45	Ben Roethlisberger
682	16	22	3.23	Andrew Luck
613	16	21	3.43	Tom Brady

Now, let’s try some more “advanced” queries and analyses.

Faster insights with PostgreSQL and TimescaleDB

So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements.

Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.

This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.

Faster queries with TimescaleDB continuous aggregates

We noticed that we often needed to build queries that started with the tracking table, filtering data by specific players, positions, and games. Part of the reason is that the play table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the tracking table to identify the players who were involved in any given play.

The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.

This means that every row in tracking has to be read and aggregated before we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!

On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).

Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our continuous aggregates docs.)

CREATE MATERIALIZED VIEW player_yards_by_game_
WITH (timescaledb.continuous) AS
SELECT player_id, position, gameid,
 time_bucket(INTERVAL '1 day', "time") AS bucket,
 SUM(dis) AS yards
FROM tracking t
GROUP BY player_id, position, gameid, bucket;

After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.

WITH total_position_yards AS (
    SELECT sum(yards) position_yards, POSITION, gameid 
FROM player_yards_by_game t 
    GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION
ORDER BY game_date, position;

We get the same result, but now the query runs in 100ms - 800x faster!

Advanced SQL data analysis with TimescaleDB hyperfunctions

Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.

It is for this kind of analysis that we built TimescaleDB hyperfunctions, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.

Grouping data into percentiles

The NFL dataset is a great use case for percentiles. Being able to quickly find players that perform better or worse than some cohort is really powerful.

As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position, gameid
)
--Find the mean and median for each position type
SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards
FROM sum_yards
WHERE POSITION IS NOT null
GROUP BY position
ORDER BY mean_yards DESC;

position	mean_yards	median_yards
FS	595.583433048431	626.388099960848
CB	572.3336749867212	592.2175990890378
WR	552.6508570179277	555.5030569048633
S	530.6436781609186	550.5961518474892
SS	522.5604103343453	551.1296628916651
MLB	462.70229007633407	490.77906906009343
ILB	402.7882871125599	403.3779668359464
OLB	393.40014271151847	390.6742117791442
QB	334.7025466893028	352.1192705472368
LB	328.9812527472519	257.72003396053884
TE	327.9515596330271	257.72003396053884

Finding extreme outliers

Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position
),
position_percentile AS (
    SELECT POSITION, approx_percentile(0.95, percentile_agg(yards)) AS p95
    FROM sum_yards 
    GROUP BY position
)
SELECT a.POSITION, a.display_name, yards, p95
    FROM sum_yards a
    LEFT JOIN position_percentile pp ON a.POSITION = pp.position
    WHERE yards >= p95
AND a.POSITION IN ('WR','FS','QB','TE')
ORDER BY position;

position	display_name	yards	p95
FS	Eric Weddle	13869.759999999997	12320.288323166456
FS	Adrian Amos	12989.439999999966	12320.288323166456
FS	Tyrann Mathieu	12565.219999999956	12320.288323166456
QB	Aaron Rodgers	7422.35999999995	6667.51452813257
QB	Patrick Mahomes	6985.989999999952	6667.51452813257
QB	Matt Ryan	6759.959999999969	6667.51452813257
TE	Zach Ertz	13124.58999999995	10667.986199523099
TE	Jimmy Graham	12693.679999999982	10667.986199523099
TE	Travis Kelce	12218.129999999957	10667.986199523099
TE	David Njoku	11502.159999999965	10667.986199523099
TE	George Kittle	11058.099999999975	10667.986199523099
TE	Kyle Rudolph	10761.949999999968	10667.986199523099
TE	Jared Cook	10678.22999999998	10667.986199523099
WR	Antonio Brown	16877.559999999965	14271.23409723974
WR	Brandin Cooks	15510.01999999995	14271.23409723974
WR	JuJu Smith-Schuster	15492.76999999996	14271.23409723974
WR	Robert Woods	15253.179999999958	14271.23409723974
WR	Nelson Agholor	15180.32999999997	14271.23409723974
WR	Tyreek Hill	15106.609999999973	14271.23409723974
WR	Zay Jones	14790.589999999967	14271.23409723974
WR	Sterling Shepard	14673.79999999996	14271.23409723974
WR	Mike Evans	14620.129999999983	14271.23409723974
WR	Davante Adams	14574.509999999951	14271.23409723974
WR	Kenny Golladay	14354.499999999973	14271.23409723974
WR	Jarvis Landry	14281.509999999971	14271.23409723974

Where can the data take you?

As you’ve seen in this example, time-series data is everywhere. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.

We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.

To get started with the NFL data:

Spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days.
Follow our complete tutorial for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.

If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, join our public Slack community. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.

🙏 We’d like to thank the NFL for making this data available, and the millions of passionate fans around the world who make the NFL such an exciting game to watch.

And, Geaux Saints 🏈!

The original blog post was a collaboration between

Attila Toth, Miranda Auhl, Ryan Booz

DEV Community