DEV Community: timhugele

Capstone Project: Predicting March Madness

timhugele — Wed, 27 May 2020 21:04:20 +0000

Name	Github
Tim Hugele	Github

As a student at the Flatiron School, I decided to continue the work from my
previous module 4 project and try to predict March Madness outcomes for my Capstone Project. I wasn't particularly happy with my results and wanted to see if I could improve my model from before. I ended up starting over completely and didn't use any of the code from my previous attempt.

Data Source

Most of the data for this project was obtained from Kaggle. Additional data for this project was obtained from Kaggle users and from websites sited at the bottom of this page.

Jupyter notebooks

Data_Preparation.ipynb
- This notebook contains much of the initial feature engineering that I conducted. The main features created here were season averages for each team going into that years tournament.
Massey_Ordinals.ipynb
- This notebook is exclusively dedicated to engineering the Massey Ordinals features.
TrueSkill.ipynb
- This notebook is exclusively dedicated to engineerint the TrueSkill Rating feature.
Game_Location.ipynb
- This notebook is dedicated to engineering features that capture the distance from each team's home town to the location of the game site and also the difference in elevation between the two.
Modeling_Notebook_2.ipynb
- This notebook contains all of the modeling work that I conducted for this project. Some additional feature engineering is also contained in this notebook.

Project Presentation

Google Slides

Problem Summary

Using the data from Kaggle, and other outside sources, I am attempting to predict the outcomes of NCAA Men's Basketball Tournament games. I intend to use this information to be able to place bets on an NCAA Tournament Bracket in future years.

Metrics

The metrics that I am using for this project are log loss and accuracy. I used both of these metrics because they each provide useful
information that the other doesn't communicate. The log loss metric is on a scale from 0 to 1, with 0 being perfectly right and 1 being
perfectly wrong. The log loss is useful because it takes in the probability values that I generated with my model and compares them to
the outcome that I predicted. For example, I used a value of 1 to indicate that team 1 won a particular game. If I predicted the that the
probability that team 1 would win to be 0.75, I would get an even better log loss score if I had predicted the probability to be 0.8.
How confident you are influences the model. However, it doesn't exactly tell you how many predictions were correct, meaning, a probability
greater than 0.5.

Therefore, I used a second metric, accuracy. Accuracy is simply how many correct predictions I made divided by the total number of games.
This metric is useful because it is very easy to understand and gives you the most important information. However, I thought that the
logloss metric would still be useful, because ideally I would like my predictions to have a high degree of confidence.

Process

Research

The first thing that I did after deciding on my project was to look at what others had done. Due to this being an annual Kaggle competition there were numerous notebooks that others had posted laying out how they approached this problem. I was fortunate to have access to so many different ideas and I ended up taking ideas from a few different notebooks and incorporating them into my project. I listed the notebooks that I used at the bottom of this page.

Obtain Data/Clean Data

The next step I took was to obtain the data that I wanted to use for my project. Fortunately, much of the data that I wanted to use was made easily available on the Kaggle competition page. The dataset of the utmost importance was the dataset that contained the outcomes of NCAA Tournament games, along with all of the traditional boxscore statistics for those games. I also wanted to use the same type of data from regular season games in order to base my tournament predictions on how the teams performed during the regular season. Kaggle also provided data on tournament seeds, the location of where each tournament game was played, different spellings of team names, massey ordinals, and the conferences that each team played in. I engineered features using all of the previously listed data.

I also obtained some data from outside sources. One dataset that I was eager to use was from Kenpom.com. This is a website that provides advanced college basketball analytics, however the data that I wanted to use was behind a paywall. Fortunately, a user on Kaggle posted much of the data that I wanted to use. I also turned to Kaggle users to provide data on the latitudes, longitudes, and elevations of the cities where the tournament games were played and also the cities and towns that each university was located in.

My final data collection step was to obtain data that was missing from the previously mentioned datasets. One such dataset was the one containing game locations. The dataset only went back to 2010 while I was hoping to use data going back to 2003. Therefore, I scraped the remaining game locations from basketball-reference.com. I was also missing data on the latitudes, longitudes, and elevations of some of the cities in my datasets, and due to the low number of missing values I decided to enter in the values manually.

Fortunately, most of the data was relatively clean, so asside from having to scrape and manually add a small amount of data I did not have to do much additional data cleaning.

Feature Engineering

Most of the work that I did on this project was the feature engineering step, resulting in dozens of features. The largest share of features came from finding season averages for each team. In order to do this I used the data comprising box score statistics for every game for every team and finding the season averages for each of those statistics. This includes statistics like defensive rebounds, offensive rebounds, blocked shots, steals, free throws made, free throws attempted, etc. In addition to these I created features for:

Wins over the last 15 days of the regular season
Tournament Seed
Kenpom stastistics (adjusted offensive and defensive efficiency, adjusted net efficiency, and a measure for luck)
Win/loss target feature (1 if team 1 won the game, 0 if team 1 lost)
Number of wins against top 25 teams, with the top 25 being determined by adjusted net efficiency
Massey Ordinals. The dataset includes a large number of rating systems and each team's ratings throughout the season. I found which rating systems were the best predictors of outcomes and took the top 4 and averaged their results.
TrueSkill rating. This is a rating system developed by microsoft to be used in their video games to determine team and player quality.
Each teams traveling distance from the college's hometown to the location where the game is played.
The difference in elevation between a college's hometown and the game location.

I also created a seperate dataset for 2019 data which included all of the previously mentioned features.

I then had to merge each feature onto the original dataframe.

Models

I used a combination of two models for my final results. I used both a Logistic Regression and an XGBoost Classification model. After
obtaining predictions for both of these models I found that weighting the Logistic Regression model's predictions by 0.7 and the
XGBoost model's by 0.3 produced the best results.

I validated the model by training on seasons prior to the 2019 season and using the data from 2019 to test the model on.

Key Takeaways

Difficult to pick upsets, particularly among the top seeded teams.
Whether a feature is important depends on the model being used. There was quite a bit of difference between the logisitic regression model and the XGBoost model.
Limiting data to only the past few years improved my results. I found that the optimal results came when I limited the training set to only the past 3 years of tournament games.

Initial Results

Best model picks 55 winners correctly, missing 12 for an accuracy of 0.82, with a log loss of 0.41.

Reproduction Instructions

The first step in reproducing my results would be to obtain the data. I have attached all the data that I used in my project in a folder labeled data. However, if someone wished to use my notebook for the 2021 NCAA Tournament they would need to get an updated dataset. The first place to look would be Kaggle, where they host a competition every year to predict the tournament outcomes. I also attained the Kenpom data from a user on Kaggle, so in order to get updated Kenpom data you would either need to find another user who had posted this data or scrape it from the Kenpom website.

Once all data is obtained next step is feature engineering. To reproduce the results that I produced I would recommend running my notebooks in the following order; Data Preparation, Massey Ordinals, TrueSkill, and Game Location. Finally, the last step would be to run Modeling Notebook 2.

Next Steps

I would like to try predicting on all college basketball games for the past few years, instead of predicting on season averages dating
back to 2003. I found that my predictions were getting better when I limited the training data to only the past few years. Therefore,
this leads me to believe that college basketball changes a bit over time. The reason I didn't do this initially was that I would be
unable to use some of my best predicting metrics. The data that I have for the Kenpom metrics only has end of year statistics, and
therefore I would be unable to use them during the regular season. Furthermore, tournament seed was a useful statistic that I wouldn't be
able to use either. Despite these reasons, I would still like to attempt to model it because it would allow me to have a much larger data
set, all being data from the recent past.

I would also like to try to use individual player quality as a factor in finding the quality of a team. Doing this would both allow
me to try modeling the data in a different way, but it would also allow me to factor in player injuries into my predictions. If a team's
star player gets injured before the NCAA tournament my model would probably prove inadequate at predicting that team's results. By
rating players individually, I would be able to attempt to estimate the quality of the team without said player be reweighting the
contributions of other players.

I would also like to try and obtain information from other sources that I did not have for this project. One dataset that I used,
Kenpom data, was provided by a user on Kaggle. However, there is more data on the Kenpom website behind a paywall. I am considering
subscribing to the site to try out some of their advanced statistics.

I would also like to try more models and different ways of combining them to see if I can capture the strengths of each one.

Sources for Notebooks

Making a March Madness Bracket

timhugele — Fri, 08 May 2020 15:33:32 +0000

As a student at the Flatiron School, we were given the opportunity to come up with our own idea for a Module 4 Project. As a life long basketball fan, I thought a good project would be to try to use Data Science to generate the best possible March Madness Bracket. Fortunately, every year on Kaggle there is a competition to see who can create the best bracket. As a result, I had the benefit of seeing what other, more experienced Data Scientists had already done. While helpful, it ended up taking up much more of my time than I should have allowed it to.

Research/Data Collection

My first step was going to the 2019 Men's March Madness Kaggle competition page and looking at the data. This is where my problems began.

When I looked at the data set, I quickly became overwhelmed with the amount of data. There were approximately 70 csv files included with the competition. Being a beginner in data science, I was having a difficult time figuring out where to start.

I then decided to go through some of the previous Kaggle competitions and look into what some of the previous top performing models looked like. However, many of the first models I looked at where written in R (which I don't have any experience with yet). After spending too much time trying to understand what was going on in R-coded notebooks with code that was foreign to me, I ended up focusing on the code of one poster in particular, which can be found here.

Due to being so intimidated by the many data sets, I ended up going through all of code in this particular competition submission trying to follow how he prepared his data. After spending a too much time doing this, I decided to ultimately use the csv files that he created out of the data provided by the competition.

Modeling

Kaggle had two different types of final results that they would accept. One is a prediction for the outcomes of every possible matchup from 2014-2018. The second was a prediction for every possible matchup from 2019. I decided to focus on the latter. Furthermore, the metric that was required for the competition was logloss.

Now that I had my data, I started trying to run some models. I liked some of the modeling ideas that I got from the previously mentioned notebook that I had been learning from, so I used them in my approach.

The first idea was to get a couple of benchmark logloss results. I therefore found the logloss for a model which gives every matchup a 50/50 likelihood, which produced a logloss of about 0.69. I then tried a second benchmark model to see how a model based purely on the betting markets would do. I used a logistic regression model, and this produced a logloss of 0.55.

Finally, for my final result I ran a logistic regression model based on the Adjusted Offensive and Defensive efficiencies of each team. This produced a logloss of 0.52 (on 2016 data).

Conclusions

While I was able to produce a model that performed better than the betting odds, the model didn't really do as well as it appears. The model ultimately picked the top seed in every matchup in the 2019 tournament, and only picked 11 out of all 2278 possible matchups. While this does disclose some information (seeds where better indicators of success than betting markets), I would have liked to have produced a result superior to just picking the top seeds every time.

Future Work

For future work I would like to go back and create the data set myself. I ended up relying on someone else's data due to the slow progress that I was making, and the fact that I had an approaching deadline for finishing my project.

I would also like to used more features than just Adjusted Offensive and Defensive Efficiency. I used these because they were the features that worked best for the notebook that I was learning from, but I would at least like to test out many more features to see if I can come up with better results. Specifically, I would like to use the Massey Ordinals, which collect many different rankings of College Basketball teams, and see which ones tend to make the best predictions and implement those into my model.

I would also like to engineer a TrueSkill feature. This would produce my own rankings of how good each team in the tournament is.

And finally, I would also like to create a feature that finds the distance of the schools from where they are playing and see if teams that play close to where they are from experience a home court advantage

Lessons Learned

In the future, I realized that I need to think through my approach to a project before diving in. That is what I did with this project and I ended wasting a significant amount of time and produced a project that I was disappointed with.

Predicting Hong Kong Horse Racing Outcomes

timhugele — Mon, 20 Apr 2020 14:36:43 +0000

As a student at the Flatiron School, for my module 3 project I teamed up with Abzal Seitkaziyev to try and predict the winner of horse races.

Data Collection

We started by getting our data from Kaggle, which had a data set from the Hong Kong Jockey Club website. The races included were from 2014 to 2017.

Data Cleaning

The first step that I undertook was looking at the data and seeing if there were any categories that provided information that I didn't think would be useful. After dropping a large number of the columns, I checked the remaining categories for null values.

Fortunately, after removing unwanted columns there was only one column with null values. Since there were relatively few of them, I simply removed those rows from the data set.

However, there were some other issues with the data that needed to be dealt with. One was that there were missing values that didn't show up as missing because they were input as '---'. As a result, I searched columns for these types of missing values and removed them as well.

Feature Engineering

Feature engineering ended up being by far the main focus of mine while undertaking the project. In hindsight, I should planned ahead and set a limit on how much time I would spend engineering new features. As a result, I ended up having to rush through the encoding and modeling portions of the project in order to finish in time.

While I won't go through all of the features that I engineered, I will comment on a couple of main things that I focused on. The first thing that I new that I needed to do was to make sure that I wasn't using future information to predict the outcome of races.

For example, how fast a horse ran was clearly an import component in creating the model. However, we had to make sure to remove the speed from a particular race from the prediction process for that race, because that is not information that we will have going in. Therefore, when I was factoring the fastest that a horse had run up until that point, I made sure to remove the time from that race from consideration.

Additionally, in order to use the categorical variables in my model, I new that I had to encode them. I ultimately decided to use target encoding, but I knew that I had to be careful because of the possibility of target leakage. Since target encoding uses information about the target in the encoding process, it will bias the prediction process. You want to make predictions without any knowledge of the result, because in the future when using the model, I would have access to that information. Therefore, I used a pipeline in order to make sure that leakage did not occur.

Modeling

When I got to the modeling, at first I tried a basic version of a number of models in order to see which one performed best right off the bat. I tried a decision tree, random forest, logistic regression, support vector machine, adaboost, and a Gradient Boosted Classifier.

The metric that I judged the models on was area under the curve. I went with this metric because of what we were ultimately using the model for. I plan on using this model in order to bet on races and therefore think that the most import things to keep in mind are the number of true positive and false positive results. In other words, how often do will I win when the model tells me I should place a bet on a particular horse. The area under the curve metric includes these factors.

Results

Ultimately, I was able to achieve an area under the curve score of approximately .78 using the logistic regression model.

Future Work

I think that I can improve on my area under the curve score pretty easily by working more with the models. I spent most of my time engineering features and didn't have as much time as I would have liked to work with the models and checking to see which features should be included and which shouldn't.

Furthermore, I would like to collect more data since the Kaggle dataset only included races from 2014-2017.

I would also like to see if I can get better results by trying different types of bets instead of simply picking the winning horse.

My Experience Trying to Learn Web Scraping

timhugele — Mon, 23 Mar 2020 15:54:51 +0000

For a class project, I was attempting to scrape data from IMDb. I started by using Beautiful Soup. This was my first attempt at web scraping, and I am a beginner in data science and coding in general, having only started this data science program 2 weeks earlier. Despite struggling to get a handle on the syntax, after a lot of trial and error I eventually was able to start getting the data that I was looking for.

I was attempting to scrape the data from a page that listed the top grossing movies from 1990 through March of 2020. This particular list contained 50 movies per page. In order to do my trial and error method quickly, I was practicing scraping only the 50 movies on the first page.

My first task was to scrape the movie titles. In order to do this I simply went through each movie entry and pulled out the name and appended it to a list of movie titles (after being too far along in my effort I found out that using a dictionary for each movie would probably be a better strategy. Live and learn.) After I got it to work, I went through most of the data that was included for each movie, which included the release year, IMDb rating, Metascore, MPAA rating, runtime, genres, votes cast for the movie, and gross earnings.

With each successive category I was getting more comfortable with the process and was getting much quicker. Once I got finished getting each of these categories I was starting to feel like I was getting it...

Then the problems started. As I said before, in order to have the process go quickly, I was only scraping one page at a time, for a total of 50 data points in each category. I did make sure that my loop for running through each page worked, but then I cut back to only 1 page at a time. When I finally tried to go through 2000 movies I started getting errors.

Missing Values

What I eventually realized was that missing data was what was causing my errors. For example, some of the movies had no IMDb rating. When the trying to navigate to a particular line of text, like below:

sometimes there was no text to find.

What I eventually figured out was that an if statement to determine if the data point exists could fix it. Since the rating code was written with "strong" tags, I just needed to write an if statement checking to see if the strong tag existed. If it did not, the else statement would append the list with "None."

What this taught me was that it is probably a good idea to make your code stronger by planning for the possibility of missing values as a general rule.

Missing Values Part 2

The second big problem that I faced was similar to the first problem. However, this time the problem involved missing values other than the one that I was targeting. This problem arose when I decided to try and get some data beyond what was listed on the previous pages that I was scraping. The information that I was previously getting was all on the list page, which contained the top grossing movies since 1990. However, I was interested in obtaining information about each movie's budget. For this, I had to scrape each individual movie's IMDb page.

Being able to go to each movie's page was easy enough. However, the problem arose when trying to obtain the budget value. On the page for the first movie, I simply entered in the code that pulled up the budget data:

soup2.findAll('h4', class_='inline')[13].nextSibling.rstrip().

The problem was when I tried to go through 2000 movies. What I was doing with my code was search for every h4 inline entry, and selecting the budget one from the list. There were usually 21 h4 inline data points for each moving. However, sometimes there were missing values. For example, some pages didn't have any information on filming locations. When this value was missing, instead of budget being the 13th h4 inline value, it was the 12th. Other times, there would be a number of values missing and it would only be the 8th.

Ultimately, I used if/elif statements to check each h4 inline value to see if it was the budget, and when it was, I would append the budget list with that value. If the budget value itself was missing, then I used the process that I described previously to address that problem.

Being pretty new to coding, these problems took me quite a bit of time to figure out (these and the multitude of other smaller problems that kept arising). The truly frustrating part is usually how simple the answer turns out to be, and the intense aggravation that I couldn't figure it out sooner.

On the other hand, it is incredibly rewarding to struggle for so long to get something right, and then finally run the code and have it all come together.

Why I joined the Flatiron Data Science Program

timhugele — Mon, 02 Mar 2020 15:07:04 +0000

My name is Tim Hugele, and I just started the Data Science program at the Flatiron School in Houston, TX.

I originally studied Economics at Texas A&M and Petroleum Engineering at the University of Houston. However, after failing to find a good job I decided to try something different.

I chose to study Data Science because:

1) I wanted to gain skills that would be in demand in the labor market.

2) I felt like my comfort with calculus and statistics from my background in Engineering and Economics would make Data Science a good fit.

3) As a kid I was a basketball fanatic and would buy books on basketball statistics. I loved the idea of being able to go beyond the simple statistics that were in those books and possibly finding unique trends in the data.

4) Data Science seems to be a versatile skill set that can be packaged with either of my previously acquired degrees.