DEV Community

Rahman Badru
Rahman Badru

Posted on

2

NBA DATA LAKE

Project Description

This project automates the creation of a data lake for NBA Analytics using AWS Services. It makes use of AWS S3, Glue and Athena to store and query NBA related Data

This is the third project for the Devops All Stars Challenge

Project Architecture

Image description

Project Overview

The script data-lake.py creates an Amazon Bucket to store raw data, uploads NBA Data (json format) to the Bucket and creates an Amazon glue database and table to query the data, It also configures Amazon Athena for querying data stored in the S3 bucket.

Project Setup

  • Get your api key from sportsdata

  • Create the file called data-lake.py and copy the contents of the src/data-lake.py inside

  • Create the env file:

    nano .env
    
  • Add the following variables in the env file:

SPORTS_DATA_API_KEY="{SPORTS_DATA_API_KEY}"
NBA_ENDPOINT="https://api.sportsdata.io/v3/nba/scores/json/Players"
Enter fullscreen mode Exit fullscreen mode
  • Install requirements from the requirements.txt:
python3 -m venv venv ## To create a virtual environment
source venv/bin/activate ## To activate virtual environment
pip install -r requirements.txt ## to install requirements
Enter fullscreen mode Exit fullscreen mode
  • Set your bucket_name and glue_database_name in the data-lake.py file.

  • Run the code with python3 src/data-lake.py

Image description

  • Login to AWS Console and confirm:
    • The S3 Bucket has been created
    • The Amazon Glue Database has been created
    • Run an athena query to confirm with this, Go to Amazon Athena and paste in Query Editor:
 SELECT FirstName, LastName, Position, Team
 FROM nba_players
 WHERE Position = 'PG';
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

  • You can clean up resources by runnning python3 src/delete-resources.py

Conclusion

This project demonstrates the power of AWS services in creating a robust and automated data lake for NBA analytics. By integrating Amazon S3 for data storage, AWS Glue for data cataloging, and Amazon Athena for querying, we have built a scalable and efficient pipeline for managing and analyzing NBA data. The simplicity of the setup process ensures accessibility for developers, while the automated cleanup script emphasizes resource management and cost efficiency. This project not only highlights the capabilities of cloud-based data lakes but also serves as a foundation for exploring more advanced data analytics and insights, making it an invaluable asset for sports analytics enthusiasts and professionals.

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay