Project Description
This project automates the creation of a data lake for NBA Analytics using AWS Services. It makes use of AWS S3, Glue and Athena to store and query NBA related Data
This is the third project for the Devops All Stars Challenge
Project Architecture
Project Overview
The script data-lake.py
creates an Amazon Bucket to store raw data, uploads NBA Data (json format) to the Bucket and creates an Amazon glue database and table to query the data, It also configures Amazon Athena for querying data stored in the S3 bucket.
Project Setup
Get your api key from sportsdata
Create the file called data-lake.py and copy the contents of the
src/data-lake.py
inside-
Create the env file:
nano .env
Add the following variables in the env file:
SPORTS_DATA_API_KEY="{SPORTS_DATA_API_KEY}"
NBA_ENDPOINT="https://api.sportsdata.io/v3/nba/scores/json/Players"
- Install requirements from the
requirements.txt
:
python3 -m venv venv ## To create a virtual environment
source venv/bin/activate ## To activate virtual environment
pip install -r requirements.txt ## to install requirements
Set your
bucket_name
andglue_database_name
in thedata-lake.py
file.Run the code with
python3 src/data-lake.py
- Login to AWS Console and confirm:
- The S3 Bucket has been created
- The Amazon Glue Database has been created
- Run an athena query to confirm with this, Go to Amazon Athena and paste in Query Editor:
SELECT FirstName, LastName, Position, Team
FROM nba_players
WHERE Position = 'PG';
- You can clean up resources by runnning
python3 src/delete-resources.py
Conclusion
This project demonstrates the power of AWS services in creating a robust and automated data lake for NBA analytics. By integrating Amazon S3 for data storage, AWS Glue for data cataloging, and Amazon Athena for querying, we have built a scalable and efficient pipeline for managing and analyzing NBA data. The simplicity of the setup process ensures accessibility for developers, while the automated cleanup script emphasizes resource management and cost efficiency. This project not only highlights the capabilities of cloud-based data lakes but also serves as a foundation for exploring more advanced data analytics and insights, making it an invaluable asset for sports analytics enthusiasts and professionals.
Top comments (0)