DEV Community: Eric Wehmueller

Handling Imbalanced Multiclass and Binary Classification Datasets

Eric Wehmueller — Fri, 10 Sep 2021 07:53:47 +0000

In this article, I’m going to discuss how to properly handle imbalanced datasets that can be either multiclass or binary classification problems using XGBoost and problems I encountered doing this for the first time. This is somewhat oddly specific, but it can be applied to other classification problems with similar issues.

Multiclass Classification

While midway through my Capstone project in the Flatiron School’s Data Science program, I encountered an interesting issue. Using the default XGBoost settings, my recall and precision score for one of my three classes were both sitting at an extremely low value of 0.22. I had never seen this before for a model with an overall accuracy score in excess of 80% until I realized; I have an imbalanced dataset. Sure enough, this is the distribution of outcomes and classes I observed in my classification report:

        class  precision    recall  f1-score   support

          -1       0.27      0.24      0.25       142
           0       0.82      0.73      0.77       532
           1       0.74      0.81      0.78       750

    accuracy                           0.72      1424
   macro avg       0.61      0.59      0.60      1424
weighted avg       0.72      0.72      0.72      1424

If you find yourself in a similar scenario, the fix is actually fairly simple. There are two parameters you need to make sure to set on your XGBoost- setting objective to ‘multi:softmax’, and your ‘num_class’ parameter to your number of classes. The documentation was somewhat misleading, and this took me quite a while to run without warnings. By default, XGBoost will attempt to set the objective to ‘binary:logistic’, which is quite simply just not what our problem is. As a result, we can see how our “-1” outcome class has such low values across the board. This is what our code should look like in this instance.

model_xgb = XGBClassifier(objective='multi:softmax', num_class=3)
model_xgb.fit(X_train, y_train)
pred_xgb_test = model_xgb.predict(X_test)
print(classification_report(y_test, pred_xgb_test))

Although my recall was not improved for this class, the precision shot up to nearly 80%. This is a vast improvement from our last model, as we actually have our parameters and classifier set up correctly. It was at this point in my project that a thought entered my mind- what if one of the class outcomes is irrelevant? Interestingly, this just became a binary classification problem.

Binary Classification.. With a Surprise

At this point, I had already done quite a bit of research on XGBoost through it’s documentation and a variety of stackoverflow posts of people having similar issues. During that search I had already found the solution to the binary classification; the scale_pos_weight parameter needed to be set. Essentially, this parameter would multiply the size of the positive class, namely, 1 by a factor of whatever number you provide, to even out the class sizes. However, there was an issue- my positive class size was the majority class… by a factor of 6. So what does one do in a scenario like this? There is no “scale_neg_weight” parameter, as I found out. This meant I was going to re-engineer my outcome feature to essentially invert my classes. Without modification, my outcome field consisted of only one of:
[-1,0,1]

For context, this was originally a “favors_pitcher” field and the values speak for themselves. The class I was trying to remove was the “0” class, the neutral outcomes that don’t really favor either team in the short term. So I dropped these rows, and re-engineered a new outcome field with some crafty python.

no_neutrals_df['favors_hitter_binary'] = 1
no_neutrals_df.loc[no_neutrals_df['favors_pitcher']==1, 'favors_hitter_binary'] = 0

Perfect. Now I had only zeros and ones in my new “favors_hitter_binary” column for my new binary classification problem. There were six times as many “zero” entries as there were “one” entries, so now we were able to use our XGBoost parameter from earlier.

nn_model_xgb = XGBClassifier(scale_pos_weight=6) #6x value count diff
nn_model_xgb.fit(X_train_nn, y_train_nn)
pred_xgb_nn = nn_model_xgb.predict(X_test_nn)
print(classification_report(y_test_nn, pred_xgb_nn))

Although my binary model ended up overfitting and not really filling my needs in the grand scheme of the project, I felt like it was a worthwhile endeavour learning how to handle both scenarios. The model could have performed extremely well across the board for the positive hitter outcome class and I would have been one happy programmer. However, the precision and recall scores were too low, both around 0.33. Additionally, I had to re-do my own train-test-split just for this tangent, which was not ideal, even though I deemed it worth investigating.

Regardless, I hope this retrospective is helpful for you as you tackle your own classification problems, or using XGBoost. This is definitely something I’ll be able to handle more easily in the future. If you’d like to see the full context of what I was working on, please check out my MLB pitch classification project here.

An Intro to pybaseball

Eric Wehmueller — Fri, 10 Sep 2021 03:14:07 +0000

In this article, I will introduce you to pybaseball- an extremely helpful tool for collecting baseball data by showing some basic examples to get you started.

For my Capstone project in the Flatiron School’s Data Science program, there was no question I wanted to select a project I was passionate about. After being inspired by my extended family’s fantasy league smack talk, I decided that a project involving baseball was the right choice. However, one immediate difficulty I had in the project is that I was essentially drowning in data; there are so many datasets available online which provide either too many entries, too many features, or some combination of both. After some discussions with a colleague of mine, I started using a python library called pybaseball. If you’re doing anything related to pitch or hitter related data, then look no further- your life is about to get a lot easier.

I’m going to run through some basics to help you get started, so that you can query for your own data with whatever teams or players you desire. To get started, you can install via pip:

pip install pybaseball

Next, we’ll go over a basic example using this library. Let’s say that ESPN is down and I want to check to see how my team, the St. Louis Cardinals, is doing in the NL Central race. In one of my practice notebooks, I have the following block of code:

from pybaseball import standings
data = standings(2021)[4]
print(data)

                   Tm   W   L  W-L%    GB  E#
1    Milwaukee Brewers  86  55  .610    --  --
2      Cincinnati Reds  74  67  .525  12.0  10
3  St. Louis Cardinals  70  68  .507  14.5   9
4         Chicago Cubs  65  76  .461  21.0   1
5   Pittsburgh Pirates  50  90  .357  35.5   ☠

This can also check previous season’s results and it is provided in the form of a dataframe. Let’s move on to a more useful example- pitch data. As an example, let’s say that we are trying to figure out how to strike out an elite hitter in our division- Jesse Winker for the Cincinnati Reds. We are going to pull in all the Statcast data that we can (metrics being actively measured since 2015) over the last 5 years. First, we need to find a player ID, then use this player ID to pull in the Statcast data in a separate call.

player_info_df = playerid_lookup('winker','jesse')
player_info_df.head()
print(player_info_df['key_mlbam'][0])

608385

jwinker_id = 608385
df = statcast_batter('2016-08-01','2021-08-01', jwinker_id)
print(df.shape)

Gathering Player Data
(5805, 92)

There are a lot of features just from this one request (92), but I believe most of them to be essential, relevant, and applicable. I’m only excluding showing the head of the dataframe here due to the massive size. Fortunately, the number of features can be easily filtered down. For my specific model, I started with the following fields, as they were the most relevant for pitch info:

['pitch_type','p_throws','release_speed', 'plate_x','plate_z','pfx_x','pfx_z','vx0','vy0','vz0', 'ax','ay','az','release_spin_rate','strikes','balls']

To save you some time, I'll tell you what some of these metrics refer to. The pfx_x and pfx_z fields refer to the horizontal and vertical movement of the ball, respectively, relative to where the pitcher initially threw the ball. The plate_x and plate_z refer to the pitch location from the catcher's perspective. The fields starting with 'v' indicate the velocity in all 3 dimensions, and the fields starting with 'a' indicate the acceleration in all 3 dimensions.

Unfortunately, the documentation is not very verbose about what all these column names actually are or what they mean. To help with this, a website called Baseball Savant has put a list together for these fields. If you’re unsure what a metric is, you can reference this page. This is something I wish I had access to when I first started becoming familiar with the pybaseball library.

For my project, I attempted to create a model that would classify a pitcher's outcome (negative, neutral, or positive) based on the metrics of the pitch in order to better understand what makes a “good pitch” in the MLB. To do this, I made a model containing all pitches against all starting roster members of the starting roster of the Cincinnati Reds. If you’re interested in the results of this classification project, I highly recommend that you check it out at my GitHub here. I hope this gives you enough to get started so that you’re not overwhelmed by the amount of data being collected in Major League Baseball in today’s day and age.

Explorations in League of Legends Data

Eric Wehmueller — Thu, 09 Sep 2021 23:55:03 +0000

In this article, I’m going to discuss a recent classification project I worked on relating to the massively popular video game, League of Legends, as well as enhancements and other insights that could be gained from this data.

The Setup

As a part of my Data Science curriculum, I was instructed to devise a scenario in which we attempt to model some type of classification problem. Unlike other projects, I was not given a data set or starting point. Instead, the only instructions were to find something I was passionate about, as long as it resulted in classification. After searching Kaggle for inspiration, I found a data set that really excited me: League of Legends data. It was from this find that I realized it would make a perfect classification problem; a variety of metrics were “snapshotted” at the 10 minute mark, as well as the end result (win or loss) for the same game.

At this point, I created my hypothetical business problem: I have been hired by the esports organization Cloud9 as a player coach/analyst for the professional League of Legends team. They are competing at the top level and are looking to win every game they possibly can, as there is a lot of money on the line. My job is to help them determine the most important factors in winning League of Legends games. I am to investigate what I should be advising our players to focus on in the first 10 minutes of each game to provide the highest chance to win the game.

Caveat

As a fair warning, from this point I’m going to discuss some of my findings from the perspective of a player to give as much of a “deep dive” into this data as possible. If you’re not familiar with the game and some of the terminology, it will be hard to follow along with the rest of this article. If you’re in this boat, I highly recommend you check out my project on GitHub here. The notebook (technical audience) and corresponding pdf presentation (non-technical audience) gives a much higher-level look at my findings than I will discuss here.

Investigation

After taking an initial look at the data available to me, I sought out answers to three questions, which I believe could be answered effectively.

What is the single most important determining factor in winning a game?
What objectives should our players prioritize?
What objectives should our players ignore?

After some initial cleaning, feature engineering, and iteration over multiple models- my feature importances graph for an XGBoost model showed some interesting results.

Obviously, the single most important factor in determining game outcome in professional games is the gold differential at 10 minutes. I only say this because anything worth doing in League of Legends gives you gold. During professional game broadcasts, the total gold of each team is shown on the scoreboard (along with total kills and towers). However, any decent player knows that this is just a fundamental part of the game. Having more gold means you have more items, more items means that you have more stats on your character, more stats means you’re more likely to win fights and accrue more gold more quickly. So ultimately, this doesn’t really provide much insight to a high-level player. It is mostly just confirming a basic fundamental of the game that one learns from playing. At this point in my data science adventure, this was extremely reassuring, because I knew the context of what this meant from a technical perspective, as well as a player; my model was working properly.

From this visual, one can also see that the number of dragons taken is the next highest determining factor in predicting outcome for our model. This is where we start to get actual insights for professional players. League is a game of many choices, and as a team you have to decide what objectives to take. There’s only a limited amount of time in the game to accomplish things. Each player can really only be in one place at one time, and it takes time to move across the map. For example, if Red team decides to send four players for dragon, but blue team can only have two players there to contest it, the red team will almost assuredly secure the dragon while potentially killing the blue players in a 4vs2 scenario. In return, however, the remaining blue members will be getting waves and potential towers on the other side of the map. This model seems to place high importance on making sure to secure those dragons, even if it is at the cost of losing towers and minions. Similarly, the model places the absolute lowest value on taking Rift Heralds, as it is not a heavy contributing factor to wins in these games.

Model Weaknesses and Potential Future Work

Obviously, we were able to answer our questions laid out at the start, but I definitely question how useful this would be for players within an organization. Firstly, here are definitely some issues with the data. The data was taken from 10,000 professional games; however, this is over the course of two years’ worth of matches. Something not taken into account is the fact that the game developer, Riot Games, introduces balance patches every two weeks to keep the game/meta feeling different and “fresh”. The game itself is in a constant state of change; one week they may just suddenly decide to change the gold value that turrets provide, or the amount of stats given by a dragon type. Therefore, our model cannot easily take this into account.

Another drawback to only these features is the fact that none of the champion picks are taken into account. In my opinion, this can sometimes be more important that the actual game state itself, as some teams build around sacrificing early objectives so that their late game “scaling comp” can reach their power spikes. There are over 150 champions in League of Legends, and the team combinations are nearly endless. Although professionals can generally get a good grasp on what the strongest picks are for each role, sometimes the strongest individual picks do not synergize well with other “strongest” picks in other roles. Sadly, the data I had available does not take this into account, but I think it’s an extremely important part of the game.

If I was an actual analyst for a professional team, I think one of the most valuable questions I could answer would be: “Which champion would be best to pick in this game, for this particular patch?” Although the data in my project is not able to answer a question like this, there are some interesting resources I’ve found which help me make my own champion selections in games. For example, this site shows Draven’s win rate against every other champion over the past 30 days for the top 10% of players, still bringing in a huge sample size of nearly 300,000 games. Let’s say the enemy team already picked Draven; we can consult the win rate matrix per role to determine which one of our own picks can give us the best chance to win. In the future, expanding the project to take champion selections and win rate matrices into account for a classification model would be difficult but extremely exciting.

A Simple Guide to an "Easy on the Eyes" Jupyter Notebook

Eric Wehmueller — Thu, 09 Sep 2021 20:02:57 +0000

In this article, I will discuss some visual settings and other general recommendations I have for setting up your Jupyter Notebook.

Before embarking on my journey in the world of Data Science, I was a software developer working on a variety of technologies in the mobile space. Unfortunately for me, this meant I had to set up many different IDEs and other environments that were required for the tasks I needed to complete. Immediately upon installation, I would always make sure to change the “Theme”. I’m not typically a person who is obsessed with making sure all the aesthetics of something look perfect; however, one month’s work within Xcode and its default settings is all it took for my tired eyes to tell me “I can’t continue working like this”. Something I can’t stand is how nearly every development application always defaults to a “Light Mode” equivalent, when the primary users are developers spending long hours looking at a very bright, white background. Beginning my work within a Python environment was no exception to this. I’d like to pass on to you my own settings that I found for a Jupyter Notebook environment in hopes that I can save you some time, as well as your own eyes in the long run.

I highly recommend using a package called “jupyterthemes”. Not only does this give you some quick and easy theme options, but it also gives you the customization freedom to tweak any other visual elements to your preference.

I’m not going to go over in-depth how to go through the initial setup for the installation of Python and Anaconda. The following instructions only apply if you’re able to actively run a Jupyter Notebook on your local machine. I will, however, go over the installation of jupyterthemes. And this is the first step! You’re welcome to follow along with the documentation here, but I will be highlighting the parts that were most important for me.

From the command line:

pip install jupyterthemes

Simply wait for this process to complete. Easy. Now comes the fun part. I know everyone’s preferences are different, but I’ll run you through the setup I use on any new device while using Python. Here it is:

jt -t onedork -fs 95 -altp -tfs 11 -nfs 115 -cellw 65% -T -N -altmd

This command might look confusing to a new user, but I’ll go over each of the important settings in this single line. Hopefully, you’re already much happier with this more classic-looking dark theme for your notebook. It is worth noting that you may need to restart or refresh your notebook in order for these changes to be applied. Let’s look through these settings briefly, as the documentation for each of these is not very verbose.

-t onedork
This is simply our setting for the “onedork” theme. This is my personal favorite, but other options include grade3, oceans16, chesterish, monokai, solarized1, and solarizedd. You can preview these with your own “jt -t” commands or viewing previews of each uploaded here.

-fs 95
This is the Font Size. I generally don’t like my font to be very large, and this setting makes the font slightly smaller than default to show slightly more code on the screen at a time. The value 95 translates to 95% of the default size.

-altp
This is the “alternate prompt layout”. This essentially makes leaves a narrower layout and excludes line numbers. Typically, I like my notebook looking as clean as possible when showing it to someone, and this setting reduces a little bit of that clutter.

-tfs 11
This is the font size for text within Markdown cells, slightly smaller than the default.

-nfs 115

This is the font size of the actual settings within the notebook. Namely, “File”, “Kernel” and other options at the static top of the page. 115 is equivalent to 115% of it’s default size.

-cellw 65%
This is the “cell width”. I have this option set to 65% so that the cell will essentially extend outward from the center, covering 65% of the width of the window. I like this value, because at higher values for this setting, our code is not as “centered”.

-T -N -altmd
These are the settings to toggle the toolbar, name and logo of the notebook, and the color of the markdown. For this alt markdown toggle setting, the markdown color blends into the color of the actual notebook. I like this setting because it highlights the focus on the actual code in the notebook and the markdown ends up looking cleaner as a result.

I hope this helps you save some time from trying out all of these settings yourself, or at the very least gives you a solid visual option for working within a Jupyter Notebook for many hours at a time. However, I know that if you’re a developer, you’re probably going to want to try EVERY option at your disposal. If that’s your thing, check out the jupyterthemes documentation at https://github.com/dunovank/jupyter-themes to find your perfect settings.