Victor Robin

Posted on Feb 15

Setting Up an NBA Data Lake with AWS Services

#aws #python #automation #analytics

Introduction

In this article, we'll walk through how to set up an NBA data lake using AWS services. A data lake is a centralized repository that allows you to store structured and unstructured data at any scale. By leveraging AWS services such as Amazon S3, AWS Glue, and Amazon Athena, we can automate the creation of a scalable and efficient data lake for NBA analytics. This setup enables seamless data storage, transformation, and querying to facilitate insightful analysis.

Overview

The setup_nba_data_lake.py script automates the process of creating a data lake for NBA-related data. Specifically, it performs the following actions:Repo

Creates an Amazon S3 bucket to store raw and processed NBA data.
Uploads sample NBA data (in JSON format) to the S3 bucket.
Configures AWS Glue to create a database and an external table for structured querying.
Integrates Amazon Athena for querying NBA data efficiently using SQL.

By the end of this setup, users will have a fully functional data lake that can be queried using Athena, making it easier to analyze NBA statistics and trends.

Prerequisites

Before running the script, ensure you have the following:

AWS Account

You must have an active AWS account to create and manage the required resources.

SportsData API Key

Create a free account at SportsData.
Navigate to "Developers" > "API Resources" > "Introduction & Testing."
Sign up for the "SportsDataIO API Free Trial," selecting NBA as the preferred data source.
Retrieve your API key from "Query String Parameters."

IAM Role/Permissions

Ensure the IAM user or role executing the script has the following permissions:

S3: s3:CreateBucket, s3:PutObject, s3:DeleteBucket, s3:ListBucket
Glue: glue:CreateDatabase, glue:CreateTable, glue:DeleteDatabase,glue:DeleteTable
Athena: athena:StartQueryExecution, athena:GetQueryResults

You can create a policy using the policy file from the repository and apply it to the user.

If you're using a different S3 bucket name, update all instances of "victor-analytics-data-lake" accordingly:

In the IAM policy - Modify the bucket name under the Resource field.
In the script for resource deletion - Update the bucket name declaration.
In the setup_nba_data_lake.py script under the bucket name section - Ensure it matches your chosen bucket name to avoid execution errors.

Below is a screenshot highlighting all areas that need to be updated if using a different S3 bucket name. If you are using "victor-analytics-data-lake", you can ignore this step.

AWS CloudShell or CLI

Use AWS CloudShell or set up the AWS CLI on your local machine to execute the script.

With these prerequisites in place, you're ready to start setting up the NBA data lake. In the next section, I'll guide you through the step-by-step implementation.

Step-by-Step Guide

Step 1: Log in to Your AWS Console

Go to AWS and sign in to your account. At the top, next to the search bar, you will see a square icon with a >_ inside. Click this to open AWS CloudShell.

Step 2: Create a .env File

In the CLI (Command Line Interface), type:

nano .env

Paste the following lines into your file, replacing your_sportsdata_api_key with your actual API key:

SPORTS_DATA_API_KEY=your_sportsdata_api_key
NBA_ENDPOINT=https://api.sportsdata.io/v3/nba/scores/json/Players

Note: If you're not comfortable using Nano, prepare the content in a text editor before pasting it into the terminal.
Press Ctrl + O to save, then Enter, and Ctrl + X to exit Nano.

Step 3: Create the setup_nba_data_lake.py File

In the CLI, type:

nano setup_nba_data_lake.py

In another window, visit the GitHub repository.
Copy the contents of the setup_nba_data_lake.py file. (Don't forget to edit the bucket name in the code if you want to.)
Return to the CloudShell window and paste the contents into your file.
Save and exit the file.

Step 4: Run the Script

Run the following command:

python3 setup_nba_data_lake.py

You should see messages confirming that the resources were successfully created, the sample data was uploaded, and the Data Lake setup was completed.

Troubleshooting: If you encounter an error about dotenv not being installed, run: pip3 install python-dotenv and then rerun the previous command.

With the script successfully executed, everything is set up and can be manually verified in the AWS Console.

Step 5:Step 5: Manually Check for Resources

In the AWS search bar, type S3 and click on the S3 service.
Locate the bucket named victor-analytics-data-lake and click on it.
Inside the bucket, you should see two objects:

Click on raw-data and you will see it contains nba_player_data.json, and at the top, select Open to view the file contents. You should see various NBA player data in JSON format.

You'll see a long string of various NBA data

Querying Data in Amazon Athena

To analyze the data, follow these steps:

Go to the AWS Athena console.

In the left panel, select the database you created (e.g., glue_nba_data_lake).

Set up the Athena query result location:
- Navigate to Settings.
- Set Query result location to: s3://victor-analytics-data-lake/athena-results/
- Click Save.

In the Editor tab, paste the following SQL query to retrieve point guards (PGs):

SELECT FirstName, LastName, Position, Team
FROM nba_players
WHERE Position = 'PG';

Click Run Query. The results will be stored in S3 and displayed under Query Results.

To verify the data, run:

SELECT * FROM nba_players LIMIT 5;

You should see a sample of player data.

Deleting Resources

To delete and clean up all the resources created, follow these steps:

Copy the contents of delete_aws_resource.py from the online repository.
In the terminal, create a new file by running:

nano delete_aws_resource.py

Paste the copied content into the file, save, and exit.
Run the script to remove all resources:

python3 delete_aws_resource.py

This will ensure a clean teardown of all the resources provisioned during the setup.

DEV Community