Extraction, transform, load is a process of extracting data from various data sources, transforming them to clean formats which then can be loaded in a database.
Extraction & transformation
in extraction we basically get data from various data sources, the data sources could include databases, api keys, images, videos and various files i.e json, xml.
i use python to extract the files form the data sources. i recently worked with api key where i get the url and assign a variable to it and used the requests library to get the data from the api. Then i change the data into a json format. i the used pandas library to create a data frame for the data i have. which covers for the transformation part.
load
loading the data into the database using the engine library where i had to input my database connection string, which includes the database name, the host plus port and the password.
Automation
say one is trying to get weather data for a particular place, now the weather changes every moment, one needs to run the etl process every now and then in order to gets the latest data. It might be hectic to keep running the file manually, that is where airflow comes in, airflow is an apache which automates an etl process A DAG (directed acyclic graph) is a mathematical structure consisting of nodes and edges. In Airflow, a DAG represents a data pipeline or workflow with a start and an end. The mathematical properties of DAGs make them useful for building data pipelines: Directed means there is a clear direction of flow between tasks.
all you need to know about airflow
first you need to create a virtual environment. In python we use the following command: python -m venv venv(name of you environment) the you need to activate it by the following command: source venv/bin/activate.
You the need to install airflow using the following command: pip install apache airflow==2.8.0(this is the version), if you're using a linux os or wsl you might need to use sudo. After installation is done, you need to adjust some config files to suit you to use. So you need to change path to your airflow folder i.e cd airflow, then you open the configuration file which is normally saved as airflow.cfg by using the following command: nano airflow.cfg. What we need to edit in this file first is the executor we change it to localexecutor because we want our processes to be executed locally, this works when you are not using sqlite as your database. We also edit to show examples and we set it to false so our airflow user interface won't show examples of dags.
the next is the db config where we need to add our connection string, for me i used a postgres database and this is the syntax:db_url = "postgresql://postgres:your password@localhost:5432/db name".
we need to save the file the exit back to our folder, where we can run airflow db migrate to allow the airflow to move from the default sqlite to your preferred database. we then run airflow db init to initialize our database and airflow webserver and follow the port to local host to enable us view our ui.
Top comments (0)