DEV Community

loganwohlers
loganwohlers

Posted on

Scraping the NBA p1- Players/Teams

As a side project - I've been building out an NBA API using free statistics from basketballreference.com. Ultimately my goal is to provide a simple API for anyone who wants to use basketball stats so no one else has to jump through all the hoops that I have- why the NBA doesn't provide free JSON data is beyond me- but that's where this project comes in. The best current option is balldontlie.io which is nice but doesn't provide all the statistics the API I'm envisioning should have. So to remedy this I've been working on scraping this data in mass and saving it to my own personal database that will be hosted somewhere with documented endpoints. The project is decently close to release- and I've got the scraping process down which I figured I would expand on here. I started this project in Rails using the Nokogiri gem but have since switched to using Node and Cheerio/Puppeteer to do the scraping work-- the process is basically the same but since I've been more into JS lately I'll go from that perspective. So without further ado here's part one of this series-- Players and Teams.

Let's start with teams since I will go much further in detail on them on a later post. For now- the 30 teams in the NBA (RIP SONICS) are hardcoded in a static JSON file- with each having an object that contains the team's name, city, conference, and tri-code (ie LAL for Los Angeles Lakers, ATL for Atlanta Hawks, and so on). There is a corresponding table in the database with this information- so whenever team data needs to be seeded in the process is as simple as running through this file and creating rows for every team. In my current build teams also have seasonal data with their average stats as well as their opponents- which can be found at this URL (https://www.basketball-reference.com/leagues/NBA_2019.html). This will be expanded on in a later post but for now a simple team table is more than enough to start with our players.

Now on to some actual scraping for player data. The current database is set up such that a player is their own entity- that is they don't BELONG to a team, rather a player plays player_seasons that belong to a team and a season. For every season basketball reference provides a table containing every player who finished the season on an NBA roster along with their season averages (ie https://www.basketball-reference.com/leagues/NBA_2019_per_game.html). A quick inspection on the page reveals the table has an id of #per_game_stats. So using any scraping method we first load in this url and then get straight to this table.

All of the actual player info is contained in the body- so we get to the body and search for all table rows and start to iterate through them for our data with something like a for loop. For every row in the body we have to loop through all the td cells and get their data. I just made an empty array and then for every row's td mapped out stat names and their values into an object that was pushed in. The names of the stats are actually provided inside the td's as an attribute called data-stat- which lets you just forego using the table header column names and get all relevant data straight from the body. Here's a snippet of what that simple code looked like.

 let result=[]
 const tableBody = $('#per_game_stats').children('tbody')
    tableBody.find('tr').each((index, ele) => {
        let row = {}
        $(ele).find('td').each((index, ele) => {
            let statName = $(ele).data().stat
            let statVal = $(ele).text()
            row[statName] = statVal
        })
        result.push(row)
    })
    return result
Enter fullscreen mode Exit fullscreen mode

With all this set up- to actually seed to database I just had to find or create a new row for the player's name (first value in the row), find the pre-seeded team's id using their tri-code, and create a new player season row with references to the said player and team. This process is actually very fast given there are only ~600-800 players contained in this table every season.

Next week I'll dive a bit deeper into the harder part-- taking a season and seeding a boxscore for every game (1230 in a season). So stay tuned.

Thanks for reading and let me know any questions/comments!

Logan

Top comments (0)