Sports are all about winning a championship. Dynasties are built on the ability to win championships. Players become legends when they bring their team championships as their legacies become invincible. In the NBA, superstars control the game more than any other team sport because only five people play on the court at one time. Therefore, every team tries to get as many superstars on their team as possible. There are three methods to get a superstar on the team. Attract one in free agency, trade for one, or draft and develop one. The first two methods require a lot of capital, in terms of money or talent. So teams try to draft rookies and develop them, without worrying about giving up money or talent. Therefore, picking the player correctly becomes vital to the success of a franchise.
For my analysis, I will be looking at approximately 1300 rookies and their stats and trying to predict if they lasted in the NBA for more than five years. This is a good start for trying to predict super star talent because rookie contracts generally last for four years. If the team sees value in the player, then it can sign another contract the rookie one ends. (Also the data set where I all of this information from had a 5 years played target so it was convenient). This is the first step and as a rookie myself, I will be trying to use methods that I've recently learned like logit and decision trees in order to create a model that predicts whether or not a rookie will get a second contract or not.
First, I decided to take a look at all of the correlations in the data.
As you can see in the picture above, there are a lot of correlations in the data. Most of the correlations make sense, like any relationships with shot attempts and made attempts (the more you shoot, the more you'll make). Furthermore, there are a lot of correlations between counting stats (like points, rebounds, and assists) and minutes played, which also makes sense because, the more you play, the more you can get these stats. There seems to be a signal with the number of minutes played that strongly affects the longevity of the player's career. One surprising observation I noticed was how there was little to know correlation between shooting threes and if the player survived for five years, because in today's NBA the three point shot is the most important shot in the game. Next, I will look at the players who got to the five year mark and compare then to the players who did not get to five years.
For this, I decided to analyze the data by incorporating some visualizations. I compared all of the given stats; however the most important difference I noticed were the number of games played and number of minutes per game.
This makes sense since the good players usually get to play more and show off their talents and skill with more time, thus earning a second contract.
Sometimes might be a bad idea to play more.
Now I will try running a basic log regression model and test its accuracy. I created an 80/20 train/test split to score my baseline model on. Then I used the train set to make the model. Then I created a scorecard and tested the model using it. I used "roc_auc" as a base measurement in order to test my accuracy on the data and ended up getting this:
This is the baseline score that I now want to beat. Time to do some feature engineering.
For my next step, I will try to find the most important features needed to create the model. The correlation heat map earlier indicated that there were a lot of stats that were related to each other, so reducing the features are necessary in order to make the model better. I decided to test the scorecard feature importance using permutation importance. My results brought me to this graph.
The results are a bit surprising. Some of the signals make sense. For example, field goals made and games played being two of the highest signals are understandable. The more games the rookie plays and the more field goals they score, the more likely they are able to showcase their talents. I did not expect the model to give little importance to the number of minutes played, when the rookies are able to get more counting stats. Furthermore, there are a lot of correlated variables that the model found important like FGM, FGA, and FG%. But hey maybe the model is finding something my intuition can't, right?
Seems like the features that I took out earlier contributed to the overall accuracy. Time to go back to the drawing board.
Clearly the permutation importance algorithm missed out on the signals causing the accuracy to go down. My next plan is to try and do a PCA to try and reduce features another way. Hopefully this will create a better accuracy score.
If anyone has anymore ideas of what I should try, I would appreciate some ideas in the comments!
Data gotten from: https://data.world/exercises/logistic-regression-exercise-1