End to end data engineering project with Spark, Mongodb, Minio, postgres and Metabase

#spark #postgres #metabase #mongodb

Utilizing of open source technologies for the implementation of a data pipeline

Architecture

Source Code

All the source code demonstrated in this post is open-source and available on GitHub.

git clone https://github.com/Stefen-Taime/projet_data.git

Prerequisites

As a prerequisite for this post, you will need to create the following resources:

(1) Linux Machine;
(1) Docker ;
(1) Docker Compose;
(1) Virtualenv;

Setup

git clone https://github.com/Stefen-Taime/projet_data.git
cd projet_data/extractor

pip install -r requirements.txt
python main.py

or

docker build --tag=extractor .
docker-compose up run

#This folder contains code used to create a downlaods folder, iteratively download files from a list of uris, unzip them and delete zip files.

At this point you should have in the extractor directory with a new folder Dowloads with 2 csv files

then

cd ..
cd docker

docker-compose -f docker-compose-nosql.yml up -d  #for mongodb
docker-compose -f docker-compose-sql.yml up -d    #for postgres and adminer port 8085, metabase port 3000
docker-compose -f docker-compose-s3.yml up -d     #for minio port 9000
docker-compose -f docker-compose-spark.yml up -d  #for spark master and jupyter notebook port 8888


cd ..
cd loader
pip install -r requirements.txt

# ! modify the path DATA and DATA_FOR_MONGODB variables in .env

python loader.py mongodb  #upload data in mongodb database (if you have an error, manually create an auto-mpg database and enter an auto collection and try again)
python loader.py minio    #upload data in minio(if you have an error, manually create a landing compartment and try again)

He must have now an auto-mpg database and inside an auto collection with data in it for mongodb and also data in minio