In this article, I will introduce you to pybaseball- an extremely helpful tool for collecting baseball data by showing some basic examples to get you started.
For my Capstone project in the Flatiron School’s Data Science program, there was no question I wanted to select a project I was passionate about. After being inspired by my extended family’s fantasy league smack talk, I decided that a project involving baseball was the right choice. However, one immediate difficulty I had in the project is that I was essentially drowning in data; there are so many datasets available online which provide either too many entries, too many features, or some combination of both. After some discussions with a colleague of mine, I started using a python library called pybaseball. If you’re doing anything related to pitch or hitter related data, then look no further- your life is about to get a lot easier.
I’m going to run through some basics to help you get started, so that you can query for your own data with whatever teams or players you desire. To get started, you can install via pip:
pip install pybaseball
Next, we’ll go over a basic example using this library. Let’s say that ESPN is down and I want to check to see how my team, the St. Louis Cardinals, is doing in the NL Central race. In one of my practice notebooks, I have the following block of code:
from pybaseball import standings
data = standings(2021)[4]
print(data)
Tm W L W-L% GB E#
1 Milwaukee Brewers 86 55 .610 -- --
2 Cincinnati Reds 74 67 .525 12.0 10
3 St. Louis Cardinals 70 68 .507 14.5 9
4 Chicago Cubs 65 76 .461 21.0 1
5 Pittsburgh Pirates 50 90 .357 35.5 ☠
This can also check previous season’s results and it is provided in the form of a dataframe. Let’s move on to a more useful example- pitch data. As an example, let’s say that we are trying to figure out how to strike out an elite hitter in our division- Jesse Winker for the Cincinnati Reds. We are going to pull in all the Statcast data that we can (metrics being actively measured since 2015) over the last 5 years. First, we need to find a player ID, then use this player ID to pull in the Statcast data in a separate call.
player_info_df = playerid_lookup('winker','jesse')
player_info_df.head()
print(player_info_df['key_mlbam'][0])
608385
jwinker_id = 608385
df = statcast_batter('2016-08-01','2021-08-01', jwinker_id)
print(df.shape)
Gathering Player Data
(5805, 92)
There are a lot of features just from this one request (92), but I believe most of them to be essential, relevant, and applicable. I’m only excluding showing the head of the dataframe here due to the massive size. Fortunately, the number of features can be easily filtered down. For my specific model, I started with the following fields, as they were the most relevant for pitch info:
['pitch_type','p_throws','release_speed', 'plate_x','plate_z','pfx_x','pfx_z','vx0','vy0','vz0', 'ax','ay','az','release_spin_rate','strikes','balls']
To save you some time, I'll tell you what some of these metrics refer to. The pfx_x and pfx_z fields refer to the horizontal and vertical movement of the ball, respectively, relative to where the pitcher initially threw the ball. The plate_x and plate_z refer to the pitch location from the catcher's perspective. The fields starting with 'v' indicate the velocity in all 3 dimensions, and the fields starting with 'a' indicate the acceleration in all 3 dimensions.
Unfortunately, the documentation is not very verbose about what all these column names actually are or what they mean. To help with this, a website called Baseball Savant has put a list together for these fields. If you’re unsure what a metric is, you can reference this page. This is something I wish I had access to when I first started becoming familiar with the pybaseball library.
For my project, I attempted to create a model that would classify a pitcher's outcome (negative, neutral, or positive) based on the metrics of the pitch in order to better understand what makes a “good pitch” in the MLB. To do this, I made a model containing all pitches against all starting roster members of the starting roster of the Cincinnati Reds. If you’re interested in the results of this classification project, I highly recommend that you check it out at my GitHub here. I hope this gives you enough to get started so that you’re not overwhelmed by the amount of data being collected in Major League Baseball in today’s day and age.
Top comments (0)