DEV Community: Ali KHYAR

Apache Airflow - Deep Dive | All you need to know about Airflow

Ali KHYAR — Fri, 24 Feb 2023 06:03:00 +0000

This blog was originally published on ali-khyar.com, if you are interested in learning moreon similar subjects visit here

What is a data pipeline?

Airflow for creating and orchestating data pipelines

DAGs and Operators

Airflow's single node architecture vs multi-nodes architecture

Airflow Setup

Airflow UI Views

DAGS in Action

DAGS Scheduling

Backfilling And CatchUp

Databases and Executors (Sequential, Local and Celery)

Grouping tasks (SubDAGs and TaskGroups)

Sharing data with XComs

Tasks conditioning (BranchPythonOperator)

Trigger rules

Conclusion

What is a data pipeline?

A data pipeline is a set of processes or tools that are used to move data from one place to another, and to transform and process that data along the way.
A simple example of a data pipeline might involve extracting data from a source system, such as a database or a CSV file, and then using a series of transformation steps to clean and prepare the data for loading into a destination system, such as a data warehouse or a machine learning model.

A data pipeline typically includes several stages, such as data extraction, data transformation, data validation, and data loading. These stages may involve a combination of manual and automated processes, and may include a variety of different tools and technologies.

Data pipeline can be used in various use cases like Extracting, Transforming and Loading (ETL), Extract, Transform, Load and Analyze (ETLA), Extract, Load and Transform (ELT) and many more.

The complexity of the pipeline can vary depending on the scope and purpose; it can be as simple as gathering and combining data from multiple sources into a single, unified view or database. This can be done for a variety of reasons, such as to improve data quality, reduce data redundancy, or to make it easier to analyze and report on the data.

For example, imagine a company that has been acquired by another company and now has multiple databases containing information about customers, sales, and inventory. In order to more easily analyze and report on the company's performance, the data from these multiple databases would need to be consolidated into a single database. This process would involve extracting the relevant data from each of the individual databases, cleaning and standardizing the data, and then loading it into the consolidated database.

A simple sample of an ETL in data pipeline, using Python, is the following:

Importing necessary libraries: pandas to load data from source, sqlalchemy’s create_engine to connect to pgsql database

    --------------------------------------------------------------
import pandas as pd
from sqlalchemy import create_engine
    --------------------------------------------------------------

read sample file (Extract)

    --------------------------------------------------------------
df = pd.read_csv("data.csv")
df["date"] = pd.to_datetime(df["date"])
    --------------------------------------------------------------

adding total column (Transform)

    --------------------------------------------------------------
df["total"] = df["price"] * df["quantity"]
    --------------------------------------------------------------

Load data into destination (Load)

    --------------------------------------------------------------
engine = create_engine("postgresql://username:password@host:port/database")
df.to_sql("sales", engine, if_exists="replace")
    --------------------------------------------------------------

Overall, data pipelines allow organizations to easily collect, process, and analyze large amounts of data, which can help make data-driven decisions and improve business operations, and its architecture is typically based on the following components:

Data sources: These are the systems or sources from which data is extracted, such as databases, file systems, or external APIs.
Data extraction: This step involves extracting the data from the sources and converting it into a format that can be used downstream in the pipeline.
Data transformation: This step involves cleaning, formatting, and transforming the data to make it usable for the next step in the pipeline.
Data loading: This step involves loading the transformed data into the target system, such as a data warehouse or a data lake.
Data validation: This step involves validating the data to ensure that it meets the quality standards and requirements before it is loaded into the target system.
Data monitoring: This step involves monitoring the pipeline to ensure that it is running smoothly and that data is flowing through it as expected.
Error handling: This step involves handling any errors that may occur during the pipeline and alerting the appropriate parties.

Some data pipeline architectures may also include additional steps such as data enrichment, or data warehousing for data analysis and reporting.

Airflow for creating and orchestating data pipelines

As we saw, Data pipelines are tasks that can either be successive or parallel between a source system and a target one.

Airflow is a popular open-source tool used to manage and schedule data pipeline tasks. It allows for the creation, management, and monitoring of workflows, which can include multiple tasks that are dependent on each other. These tasks can be defined as Python functions and can be scheduled to run on a specific schedule or triggered by certain events. Airflow also provides a web interface for monitoring the status of tasks and troubleshooting any issues that may arise.

The tools is composed from 5 essential components/services:\

webServer: which is a flask server that is serving the UI with Gunicorn
Scheduler: The daemon in charge of workflows’ scheduling
Metastore: a database where metadata is stored, a database is compatible as long as it supports sqlalchemy (Postgres recommended)
Executor: defines how tasks should be executed, the most common used ones are:
- Sequential Executor: This is the simplest executor, which runs tasks sequentially on the same machine as the Airflow scheduler. It is the default executor and is suitable for small and simple use cases.
- Local Executor: This executor runs tasks concurrently on the same machine as the Airflow scheduler. It is similar to the Sequential Executor but allows for parallelism.
- Celery Executor: This executor runs tasks concurrently on a separate worker machine or a group of worker machines. It uses the Celery distributed task queue to manage the execution of tasks.
- Kubernetes Executor: This executor runs tasks within a Kubernetes cluster. It allows for scaling the number of worker nodes up or down based on task demands.
- Dask Executor: This executor runs tasks concurrently on a separate worker machine or a group of worker machines. It uses the Dask distributed task scheduler to manage the execution of tasks.
The process or subprocess executing the task

DAGs and Operators:

When you start learning Airflow you hear DAGs a lot, everyone is talking about DAGs like:

a DAG is an abbreviation for “Direct Acyclic Graph”, It is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. DAGs define how tasks are executed, what their dependencies are, and what the order of execution should be.

They are written in Python and can be scheduled to run at a specific interval or triggered by an external event. A DAG has 2 main components which are Tasks and Operators. Operators are used to specify the dependencies between tasks and the order in which they should be executed. For example, an operator can be used to specify that one task should be run only after another task has completed successfully. Here's an example of a simple DAG file in Apache Airflow that defines a single task that runs a bash operator’s task:

You can divide operators into 3 types:

action operators: the ones executing bash commands or python functions
transfer operators: allow you to transfer data between systems, such as SftpOperator : This operator which is used to transfer files via SFTP protocol
sensor operators: used to check criterias (condition or state) before continuing executing the next task/tasks hence the word sense (wait/percept). AzureBlobStorageSensor for instance: This operator which is used to check for the existence of a specific blob in an Azure Blob Storage container and wait until it appears or disappears.

In addition, In Airflow we have the other following concepts:

Task instance: what a task called when executed
Workflow: combination of: dags with operators with tasks and with dependencies...

Important notes:

Airflow is not:

a data streaming solution: if you want to process data every seconds better not to go with airflow
a data processing framework: if you have TB of data, better go with spark or other solutions optimized to do such tasks, and if you challenge it you may end up with memory overflow error. Still, you can use SparkSubmitOperator to trigger a Spark job somewhere outside Airflow.

Airflow's single node architecture vs multi-nodes architecture?

when starting with Airflow you are probably using a single machine, is called in Airflow terms a single node architecture, in which Airflow components are interacting as follow:

Those components are talking with the help of Metastore. The Queue in the executer may not be the best architecture, but it’s suited for single-node architecture, dev environments, as well as limited tasks.

The following is how pipelines are executed in the single node architecture, but also applicable to the multi-node architecture:

you will have a dags folder for example dags-folder, where data pipelines code is stored.
both webserver service and scheduler parse dags-folder.
When it's time for a DAG to get executed, the Scheduler creates a dagRun object (an instantiation of the DAG file ) in the Metastore.
when the dagRun state becomes Ready, it creates a TaskInstance object.
The Scheduler sends the TaskInstance to the Executor.
The Executer runs the TaskInstance, and updates the status of TaskInstance in the metastore
Once the taskInstance is done, the executer updates its state.
the scheduler checks dagRun status, if done the WebServer update the status in the UI

To execute as many tasks as you want, you should use the Multi-Node Architecture (AKA, Celery), where the Queue will be an external third party service like RabbitMQ or Redis. With Celery, you can have many tasks running and spread on different nodes (workers). Multi-Node Architecture looks like figure below:

Airflow Setup

You can install Airflow with pip/pip3 using the following command pip3 install apache-airflow==version –constraint path-to-constraints.

The --constraint or -c flag specifies a path to a file that contains version constraints for the package being installed. This file is used to specify the version of the package(s) that should be installed, rather than the latest version available.
The path-to-constraints after the --constraint flag is the path of a file that contains version constraints. This file is a plain text file that lists the package name and the version number or version range that is allowed for that package. It's used to specify the version of the package that should be in
stalled, rather than the latest version available.

In order to initialize the metastore we run the command airflow db init, this command will also create some additional folders and files(logs, configuration files…). By default if you don’t specify other databasesto use, Airflow will create a sqlite database named airflow.db.
in Order to start the webserver run airflow webserver, and visit localhost:8080.

In airflow no user is created by default, you should create them manually from the cli, to create a user run the following command:

    --------------------------------------------------------------
airflow users create -u admin -p admin -f Ali -l Khyar -r Admin -e admin@airflow.com
    --------------------------------------------------------------

Airflow UI Views:

Workflow visualization is crucial for understanding and managing workflows. The following are five views that can be used to visualize Airflow's workflows:

Tree View: The Tree View is a hierarchical view of all the tasks within a DAG. It shows all the tasks and their dependencies in a tree structure. This view is useful for understanding the overall structure of a workflow and for identifying failed or skipped tasks.
Graph View: The Graph View displays a DAG and its tasks in a graphical view. This view is useful for visualizing the structure of a workflow and identifying dependencies between tasks. Users can zoom in and out and rearrange tasks to get a better understanding of the workflow.
Gantt Chart View: The Gantt Chart View displays the tasks and their dependencies in a timeline. This view is useful for identifying the start and end times of tasks and how they relate to each other.
Task Instance Details View: The Task Instance Details View displays detailed information about a specific task, including its start and end time, duration, and status. Users can also view logs for the task, which can help with debugging and troubleshooting.
Code View: The Code View displays the code that defines a DAG. This view is useful for understanding how a workflow is defined and for making changes to the code.

DAGs in action:

A we already said a DAG represents a data pipeline, which consists of tasks (nodes) and dependencies (edges) between them, tasks are created using operators. There are many types of operators available in Airflow, including PythonOperator, BashOperator, and SQLiteOperator....

Each operator represents a specific task in the pipeline. For example, let's say we have a data pipeline that involves:

extracting user data from an API
processing it using Python functions
storing it in a SQLite database.

We could create the following tasks using Airflow operators:

SQLiteOperator: create table
HttpSensor: check if API is available
PythonOperator: extract user data
PythonOperator: process user data
BashOperator: store user data in SQLite database

We would then define a DAG folder to store our DAGs, and create a DAG file that specifies the order in which these tasks should be executed. It's important to note that combining cleaning and processing data into one Airflow operator is not a best practice, as it can lead to issues in the pipeline. Instead, each task should be its own operator.

Code example of the above scenario:

    --------------------------------------------------------------
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.sqlite_operator import SQLiteOperator
from airflow.operators.http_operator import SimpleHttpOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2023, 2, 23),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

# creating an example dag 
dag = DAG(
    'example_dag',
    default_args=default_args,
    description='A DAG to demonstrate the use of Airflow operators',
    schedule_interval='@daily',
)

# task to create SQLite table
create_table = SQLiteOperator(
    task_id='create_table',
    sql='CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT NOT NULL, email TEXT NOT NULL)',
    dag=dag,
    database='my_db',
)

# task to check if API is available
check_api = SimpleHttpOperator(
    task_id='check_api',
    endpoint='api/health',
    method='GET',
    http_conn_id='my_api',
    dag=dag,
)

# task to extract user data
extract_user_data = PythonOperator(
    task_id='extract_user_data',
    python_callable=my_extraction_function,
    op_kwargs={'param1': 'value1', 'param2': 'value2'},
    dag=dag,
)

# task to process user data
process_user_data = PythonOperator(
    task_id='process_user_data',
    python_callable=my_processing_function,
    op_kwargs={'param1': 'value1', 'param2': 'value2'},
    dag=dag,
)

# task to store user data in SQLite database
store_user_data = BashOperator(
    task_id='store_user_data',
    bash_command='python /path/to/my_script.py --arg1 value1 --arg2 value2',
    dag=dag,
)

# define task dependencies
create_table >> check_api >> extract_user_data >> process_user_data >> store_user_data
    --------------------------------------------------------------

To test our DAG, we can use the airflow tasks test command, which allows us to test individual tasks within the DAG.

To share data between our tasks we can use XComs mechanism. XComs allow us to pass data between tasks by creating a key-value pair in the Metastore.
For example, the extract user data task could create an XCom containing the extracted user data as a JSON object, which could then be retrieved by the process user data task using the XCom API. We will see more about this later in this blog.

DAGS Scheduling:

One of the key features of Airflow is its ability to schedule tasks based on a variety of criteria. you can define a task's start date and its scheduled interval.

The start date determines when the task should begin running.
scheduled interval determines how often the task should be executed.

For example, let's say we have a task that needs to run every 10 minutes, starting on 01 January 2020 at 10:00am, we can define such task like the following:

    --------------------------------------------------------------
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2020, 1, 1, 10, 0, 0),
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'dummy_dag',
    default_args=default_args,
    schedule_interval=timedelta(minutes=10)
)

def dummy_task():
    print("dummy_task")
    pass

task = PythonOperator(
    task_id='dummy_task',
    python_callable=dummy_task,
    dag=dag
)
    --------------------------------------------------------------

the above code create dummy_dag, with the start date of 01 January 2020 at 10:00am and a scheduled interval of 10 minutes. We've also defined a PythonOperator called dummy_task that will be executed every 10 minutes.

One thing to note: is that the task won't start executing immediately at 10:00am. Instead, Airflow will wait until the first scheduled interval has elapsed before triggering the task. In this case, that means the task will be triggered at 10:10am on January 1st, 2020. This is referred to as the execution date in Airflow.

Backfilling And CatchUp

super important concept in DAGs are Backfilling And CatchUp.

Backfilling is the process of running past instances of a DAG that were missed due to a schedule not being set up at the time or when the DAG was paused. it can be achieved by setting the start_date parameter of the DAG and using the airflow backfill command. This command can be used to manually trigger a DAG run for all instances between the start_date and the current date.

Catchup is a feature in Airflow that allows a DAG to process all missed DAG runs during a period of time when the DAG was inactive, either due to the DAG being paused or the scheduler not running. By default, catchup is set to True, which means that the scheduler will process any missed DAG runs when the DAG is restarted. This ensures that all historical data is processed and accounted for.

Here's an example to illustrate the use of backfilling and catchup in Airflow:

Let's say we have a DAG

scheduled to run @daily
with a start date of January 1, 2023
The DAG is paused for 3 days
then restarted on January 5, 2023.

Since catchup is set to True by default, Airflow will automatically run DAG instances for January 2, 3, and 4, in addition to the January 5 instance.

To deactivate CatchUp you can set it up to False in the DAG instantiation:

    --------------------------------------------------------------
with DAG(...
    catchup=False
    ....)
    --------------------------------------------------------------

Notes: all dates in airflow are in UTC: don’t get confused if dags are not executed in your local timezone
you can change that airflow.cfg in default_timezone = “”
not recommended to change that keep everything in UTC

Databases and Executors:

Execetors in Airflow are what defines how many tasks u can execute in parallel.

It's important to understand the order in which tasks will be executed. Specifically, if you have two tasks that are dependent on each other, which one will be executed first?

In the example above, we have a DAG with four tasks: task1, task2, task3, and task4. Task1 is the first task in the sequence, but which of the next two tasks - task2 or task3 - will be executed first?

SequentialExecutor:

The answer to the previous question is that they will be executed sequentially, one after the other. This is because they are connected by a bit shift operator (>>), which tells Airflow to execute the tasks in order. In this case, task2 will be executed first, followed by task3. This sequential execution is useful for debugging, as it allows you to see the output of each task before moving on to the next one. To configure your DAG for sequential execution, you'll need to set the executor parameter to SequentialExecutor in your Airflow configuration file, This Executor is the default if configuration file is not touched. You'll also need to specify a sql_alchemy_conn parameter, which tells Airflow where to store the metadata for your DAG.

You can discover the values for these parameters by running the following commands in your terminal (under pipenv):

    --------------------------------------------------------------
airflow config get-value core sql_alchemy_conn
airflow config get-value core executor
    --------------------------------------------------------------

It's worth noting that if you're using a SQLite database to store your DAG metadata, you won't be able to run multiple write operations at the same time. This means that if you have multiple tasks that are trying to write to the database simultaneously, you may run into issues. If you anticipate a high volume of write operations, you may want to consider using a different database backend that can handle concurrent writes more effectively.

LocalExecuter:

As you can see SequentialExecutor is not that useful if you want to run multiple tasks in parallel on a single machine. Here comes LocalExecuter to help to increase the efficiency of workflows and reduce overall execution time.

To change the executer to LocalExecuter:

First, you should have a PostgreSQL database
Install the necessary packages by running the command: pip install ‘apache-airflow[postgres]’
Update the Airflow configuration file (airflow.cfg) by changing the sql_alchemy_conn parameter in the [core] section to the Postgres connection string
Verify that the database is set up correctly by running the command airflow db check
Change the executor to LocalExecutor by updating the executor parameter in the [core] section of airflow.cfg.

Initialize the database by running the command airflow db init.

Create a user account by running the command airflow users create --username admin --password admin --role admin --firstname ali --lastname khyar --email admin@airflow.com
Start the Airflow webserver and scheduler by running the commands airflow webserver and airflow scheduler, respectively.
Run your DAG and check the Gantt view to see parallel execution in action.
parallel on a single machine with localExecutor task2 and task3 in the same time (subprocesses)

The above steps will allow us to use LocalExecutor instead of the SequentialExecuter, hence running tasks in parallel and improve execution time.

Ok LocalExector is nice, it allows us to run tasks in parallel in a single machine, but what if our single machine went out of resources? how we can Scale Airflow?

Here comes KubernetesExecutor and CeleryExecutor to save.

CeleryExecutor:

Celery executors allows Airflow to scale worker nodes, Using the distributed task system provided by Celery to spread execution among multiple machine

To configure CeleryExecutor, we follow the below steps:

Install the necessary packages by running the command pip install ‘apache-airflow[celery]’
Install Redis by running the command sudo apt update && sudo apt install redis-server
Modify the Redis configuration file (sudo nano /etc/redis/redis.conf) by adding the following lines:
```
-------------------------
supervised systemd
-------------------------
```
Restart the Redis service by running the commands sudo systemctl restart redis.service and sudo systemctl status redis.service to ensure that it is running correctly.
In the Airflow configuration file (airflow.cfg), change the executor to CeleryExecutor and update the broker_url parameter to redis://localhost:6379/0 (where 0 is the name of the database) and the result_backendparameter to the same value as sql_alchemy_conn
To interact with Redis from Airflow, install the apache-airflow[redis] package: pip install 'apache-airflow[redis]'

Celery parameters (Good to Know):

In order to optimize tasks' execution with CeleryExecutor you can adjust the below parameters in airflow.cfg file:

parallelism: This parameter specifies the maximum number of tasks that can be executed concurrently across the entire Airflow installation. default value is 32, If you set it to 1, Airflow will behave like a sequential executor.
dag_concurrency: This parameter limits the maximum number of tasks that can be executed concurrently for a specific DAG. By default it is to 16, but it can be overridden on a DAG level by setting the concurrency parameter.
max_active_runs_per_dag: This parameter limits the maximum number of DAG runs that can be executed concurrently for a specific DAG. default value is 16, but it can be overridden on a DAG level by setting the max_active_runs parameter.

Note: the priority of these parameters is parallelism > dag_concurrency. This means that if you have set parallelism to a low value, it will limit the number of tasks that can be executed concurrently across the entire Airflow installation regardless of the value of dag_concurrency.However, if you have set parallelism to a high value, it will take priority over dag_concurrency, and the maximum number of tasks that can be executed concurrently for a specific DAG will be limited by dag_concurrency.

Grouping tasks:

Sometimes tasks within a DAG can be complex to manage if there's many or if complex processing is involved, so you need to group task, and you can either go with SubDAGs (not recommended but good to know) or with TaskGroups.
The idea is to move from something like this:

to something like this:

SubDAGs:

A SubDAG allows you to bundle related tasks within a DAG into a manageable DAG (DAG within a DAG).
You create SubDAGs by creating a function that returns a DAG object (encapsulate tasks), here's an example of a SubDAG named subdag_task:

    --------------------------------------------------------------

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.subdag_operator import SubDagOperator
from datetime import datetime

def subdag(dag_id, default_args):
    dag = DAG(
        dag_id=dag_id,
        default_args=default_args,
        schedule_interval=None
    )
    task_1 = BashOperator(
        task_id='subdag_task_1',
        bash_command='echo "SubDAG task 1"',
        dag=dag
    )
    task_2 = BashOperator(
        task_id='subdag_task_2',
        bash_command='echo "SubDAG task 2"',
        dag=dag
    )
    task_1 >> task_2
    return dag

default_args = {
    'owner': 'airflow',
    'start_date': datetime(2022, 2, 23)
}

with DAG(dag_id='parent_dag', default_args=default_args, schedule_interval=None) as dag:
    task_1 = BashOperator(
        task_id='parent_task_1',
        bash_command='echo "Parent DAG task 1"',
        dag=dag
    )
    subdag_task = SubDagOperator(
        task_id='subdag_task',
        subdag=subdag('parent_dag.subdag_task', default_args=default_args),
        dag=dag
    )
    task_2 = BashOperator(
        task_id='parent_task_2',
        bash_command='echo "Parent DAG task 2"',
        dag=dag
    )
    task_1 >> subdag_task >> task_2
    --------------------------------------------------------------

SubDAGs seems like a cool feature but it has its dark side, like everything in life. Although if you change the Executor for Airflow, tasks within the SubDAG will use SequentialExecutor which will slow total time of execution. Plus, you may fall into deadlocks (DAGs waiting each other to complete, causing a circular dependency that cannot be resolved).

Hence TaskGroups were introduced in Airflow 2.0

TaskGroups:

TaskGroups allows you to better group tasks and manage them easily. TaskGroups are defined with TaskGroup class like follows:

    --------------------------------------------------------------
from airflow import DAG
from airflow.utils.task_group import TaskGroup

with DAG(dag_id='my_dag', ...) as dag:
    task_1 = ...
    with TaskGroup('group_1') as group_1:
        task_2 = ...
        task_3 = ...
    task_4 = ...
    --------------------------------------------------------------

then you can set depency the usual way using bitwise operator (>>) or using set_upstram, as for the example above you can use:

task_4.set_upstream(group_1)

You can go crazy as much as you want with TaskGroups, to do things such as nested group, like in the following:

    --------------------------------------------------------------

from airflow import DAG
from airflow.utils.task_group import TaskGroup

with DAG(dag_id='my_dag', ...) as dag:
    task_1 = ...
    with TaskGroup('group_1') as group_1:
        task_2 = ...
        with TaskGroup('subgroup_1') as subgroup_1:
            task_3 = ...
            task_4 = ...
        task_5 = ...
    task_6 = ...
    --------------------------------------------------------------

By Grouping tasks with TaskGroup you will get the most out of making DAGs code cleaner more manageable and easy to read. this is powerful _init_?

Sharing data with XComs:

We saw somewhere before XComs and we didn't get into it in detail. Xcoms in Airflow are a way of exchanging data between tasks, they are basically key value pair with a timestamp. they are getting used by push pull operations. let's suppose we have a task that download files from a storage account and we need to pass a list of downloaded file (file list) to an downstream task, here's how the push operation can be done from a task:

    --------------------------------------------------------------
file_list = ['file1.txt', 'file2.txt', 'file3.txt']
task_instance = context['task_instance']
task_instance.xcom_push(key='file_list', value=file_list)
    --------------------------------------------------------------

and a pull peration in another task can be done like in below:

    --------------------------------------------------------------
task_instance = context['task_instance']
file_list = task_instance.xcom_pull(key='file_list')
for file in file_list:
    # download the file
    --------------------------------------------------------------

Tasks conditioning:

I don't know if this thing is called Tasks conditioning XD hhhhhhh but anyway the idea is how to execute what downstream task based on an upstream value which is pushed in XComs, Such operations can be done using BranchPythonOperator to define tasks execution rules, here's an example where choose_next_task is the function which triggers next task to be executed based on the xcom value of data_type:

    --------------------------------------------------------------
def choose_next_task(**context):
    data_type = context['task_instance'].xcom_pull(key='data_type')
    if data_type == 'A':
        return 'task_A'
    elif data_type == 'B':
        return 'task_B'
    else:
        return 'task_C'

branching_task = BranchPythonOperator(
    task_id='branching_task',
    python_callable=choose_next_task,
    provide_context=True
)

task_A = BashOperator(
    task_id='task_A',
    bash_command='echo "Data type A processed"'
)

task_B = BashOperator(
    task_id='task_B',
    bash_command='echo "Data type B processed"'
)

task_C = BashOperator(
    task_id='task_C',
    bash_command='echo "Data type not recognized"'
)

branching_task >> [task_A, task_B, task_C]
    --------------------------------------------------------------

Trigger rules:

Sometimes, we don't need to ensure that all upstream tasks should succeed before running a downstream task, or maybe we need to know if any failed. This can be done through trigger rules, trigger rules enables you to run downstream tasks based on the final execution status of upstream tasks, there are 9 different trigger rules:

all_success: (default) all parents have succeeded
all_failed: all parents are in a failed or upstream_failed state
all_done: all parents are done with their execution
one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done
one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done
none_failed: all parents have not failed (failed or upstream_failed) i.e. all parents have succeeded or been skipped
none_skipped: no parent is in a skipped state, i.e. all parents are in a success, failed, or upstream_failed state
dummy: dependencies are just for show, trigger at will

Here's bellow an example where if any task fails, the downstream tasks will fail as well, this is achieved using trigger_rule='one_failed':

    --------------------------------------------------------------
from airflow import DAG
from airflow.operators.bash_operator import BashOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2023, 2, 23),
}

with DAG('example_alerting', default_args=default_args, schedule_interval=None) as dag:
    task_1 = BashOperator(
        task_id='task_1',
        bash_command='echo "Hello World from Task 1"',
        trigger_rule='one_failed'
    )

    task_2 = BashOperator(
        task_id='task_2',
        bash_command='echo "Hello World from Task 2"',
        trigger_rule='one_failed'
    )

    task_3 = BashOperator(
        task_id='task_3',
        bash_command='echo "Hello World from Task 3"',
        trigger_rule='one_failed'
    )

    task_4 = BashOperator(
        task_id='task_4',
        bash_command='echo "Hello World from Task 4"',
        trigger_rule='one_failed'
    )

    task_1 >> [task_2, task_3] >> task_4
    --------------------------------------------------------------

Conclusion:

Apache Airflow is an awesome tool to create and manage data pipelines, thanks for the AkumenIA tool to introduce such great tool to me. In the next blog about Airflow, we are going to see how to use it in the cloud with Kubernetes (AKS), and how to monitor its cluster using the ELK. Thank you for reading.

Terraform 101 - Part 3/3: Modules, Built-in Functions, Type Constraints, and Dynamic Blocks | By Ali KHYAR

Ali KHYAR — Sat, 08 Oct 2022 02:32:09 +0000

Part 1/3: History, Workflow, and Resource Addressing: link here
Part 2/3: State, Variables, Outputs, and provision: link here

Modules:

simply, and without complexity, a module is a container of many resources that are used together to help avoid reinventing the wheel. Modules can take inputs (optional) and returns outputs(optional).
One module you interacted with is the root module which embodies the code files from the main working directory. When calling modules from another one, the called modules are considered children modules.
Modules can be downloaded and be called from:

Terraform public registry: which contains collections of publicly available modules, that get downloaded (when referencing them) into a hidden folder on your local system.
Private registry: you probably will go with this for closed source code or security reasons.
Local system: when you have modules folders saved on your local system, either in the configuration code folder or elsewhere, and then reference them using an absolute or relative path.

Let's look at the snippet below:

Defining modules requires using a reserved keyword which is module followed by the name of that module, in the example above is vpc_module, the main two parameters that should be inside every module are:

source: path to module folder.
version: a best practice to always keep track of the module's version, so you can avoid any unwanted side effects when deploying/redeploying the resource.

Other allowed parameters in modules are:

built-in functions like max, count, tolist, for_each…. which we will discuss later.
providers: which bind the module to a certain provider.
**depends_on: **setup dependencies.

As mentioned before, modules can optionally take inputs and return outputs, which are defined in the modules output block, and can be referenced like in the snippet below:

Looking at the snippet above, you should know that an output named subnet_id is defined inside vpc-module module:
When referencing module outputs we always start with module keyword.
In the already seen snippet:

we have region which is considered as input for this module, it is arbitrarily named because we only define it in module resource to call it later in the module's code using the following syntax: var.region

Built-In Functions:

Bilt-in functions are expressions that allow you to get a value from somewhere, transform it, evaluate it or convert it. Users cannot define custom functions but the list of the already defined functions is extensive.
Calling built-in functions in Terraform is like calling functions in any programming language: funcName(arg1, arg2, … ), let's take a look at the join function which produces a string by concatenating together all elements of a given list of strings with the given delimiter:

The delimiter in the above snippet is a hyphen - and the element between brackets are strings that we will concatenate, which will result in the string my-project-name-preprod .
Terraform happily provides a console to test things such as built-in functions and expressions, to access it use the command:terraform console

Type Constraints:

So far, we have seen primitive types like string, number, and boolean, which controls the type of given variable values.
Another type of variable is the Complex type, which is created by combining multiple types, example of complex types are list, tuple, map, and object. The complex type itself can be divided into two types;

Collection: multiple values of one primitive type grouped together against a variable, example: list(type) map(type) set(type)

Structural: multiple values of different primitive types grouped together.

Another constraint type is the any constraint which serves as a placeholder for a primitive type not decided yet.

terraform does its best to know which primitive type to assign to any, in the example above it will assign the string type to it.

Dynamic Blocks:

Dynamic blocks allow the construction of repeatable nested configuration blocks inside the following Terraform blocks: resource, data, provisioner, and provider.
imagine the following scenario, where you need to create a security group that contains many rules:

as the ingress rules add up, the security group will be hard to manage, and the code doesn't look beautiful as well, a way to clean the above code is by using the dynamic blocks. First, we can extract the data from ingress blocks into one variable that looks like this:

then by using the following snippet:

we tell Terraform using the dynamic keyword what block we want to replicate, which is in this case ingress. then we assign our variable to loop through; and for content, Terraform implicitly provides an ingress object which we access its values with the value keyword. The object name matches the dynamic argument ingress.

Conclusion:

Hope this blog gave you an idea of how modules, built-in functions, type constraints, and dynamic blocks work. In the next article, we're going to look at some hacks and tricks that can be used in Terraform.

Terraform 101 - Part 2/3: State, Variables, Outputs, and provisioners | By Ali KHYAR

Ali KHYAR — Sat, 08 Oct 2022 00:41:24 +0000

Part 1/3: History, Workflow, and Resource Addressing: link here

State:

Concepts and local storage:

State in Terraform is the mechanism with which Terraform acts the way it does, it's what helps it map real-world resources to your configuration. why it is essential? Because with it Terraform can track which resources are deployed, so that next time when we try to deploy a new configuration code, Terraform decides which resources need to be created, updated, or destroyed, this is done by comparing the state file with the configuration code.

Terraform state is tracked through a flat file named by default terraform.tfstate, a JSON dump that contains metadata and data about deployed resources. If no backend is specified in the configuration code, the state will be stored locally, but for better practices, the state should be stored remotely for better integrity and availability across teams.
As recommended to store Terraform state remotely, it is also recommended to not lose it because you have no way to know what resources got deployed previously. You wouldn't like as well the state file to fall into wrong hands, because it may contain sensitive data about your resources.

Terraform state has 3 common sub-commands:

The first command is used to list tracked resources by Terraform state. The second one is used to show details of a tracked resource. The last command is used to remove resources from the state file so they won't be tracked anymore.

Let's see the configuration below which provisions a docker image resource and spin up a container of that image locally:

First, we will initialize the state, by using terraform init a command which will create a folder named .terraform which is a local cache where Terraform retains some files it will need for subsequent operations against this configuration (providers …), terraform init also creates .terraform.lock.hcl, a dependency lock file that gets created or updated whenever terraform init gets run.
After initializing the working directory we can run terraform plan (not mandatory) to see what resources will get deployed, then run terraform applyto deploy the actual resources. Running terraform apply will create a state file namedterraform.tfstatewhich keeps track of managed resources.

Running terraform state list will show tracked resources which in this case will return:

which are the two resources we deployed. We can see more information about the state of each resource by running terraform state show <resource_type.resource_name> that will return in the case of the docker image:

Let's remove the docker image resource from being tracked by using terraform state rm docker_image.busybox-image then destroy the resources with terraform destroydoing that will remove the image from being tracked, thus it won't be destroyed and the terraform will only destroy the container as the command output shows:

State Storage:

The default behavior of Terraform state is being locally stored, but for better availability, security, and visibility across teams, it's a better practice to store state remotely like in HashiCorp Consul, AWS S3, or Azure Storage Account. Remote state storage allows among many other things sharing outputs to other code elsewhere. You can set up where the state file is stored in the terraform block, using the backend attribute. In AWS S3, the configuration will look like this:

This assumes we have a bucket created called mybucket. The Terraform state is written to the key path/to/my/key.
Using Azure, you can also store the state as a Blob with the given Key within the Blob Container within the Blob Storage Account. Those are some configuration examples:

Variables:

Variables in Terraform serve as variables serve in programming languages, it is a way of storing data, so that you make configuration code clean and reusable. Variables are declared within Terraform using the following syntax:

Variables types are two:

base: string ("anything btw double quotes"), number(15, 0.15..), bool(true, false)
complex: list(["same", "type"]), set, maps({name = "Mabel", age = 52}), object({ port = number service = string }), tuple we can define a variable type to be combined with one or more types, for instance:

Terraform variables are referenced in configuration code with var.name_of_var, and read precedence starts with ones passed through the OS environment variables and then the terraform.tfvars file, then variables in the main configuration code.
Other parameters that are useful in variables declaration are

validation: useful to find errors while script didn't start running, common example below, to test IP address validation against a regex expression using the built-in function regex (we will take a look into the built-in function in part 3/3)

sensitive data: often you need to configure your infrastructure using sensitive or secret information such as usernames, passwords, API tokens, or Personally Identifiable Information (PII). When you do so, you need to ensure that this data is not accidentally exposed in CLI output, log output, or source control, a common solution is to set sensitive flag to be true within the variable configuration.

Outputs:

Output values give you information back to the CLI about deployed resources, they are like return in programming languages functions. Below is an output that gives back the private IP address of deployed ec2 instance (resource type: aws_instance) named my-ec2:

output variable values are shown in the CLI after successful terraform apply. You can still set senstive=true to outputs, in case they contain sensitive values.

Provisioners:

provisioners give users the ability to execute commands and scripts through Terraform resources. You can run those commands/scripts either on the machine where terraform is installed or through the resources that were created with Terraform. Each provisioner is attached to a certain resource, and it has the ability to connect to it (if it needs to) via ssh and such protocols.

There are two provisioner types: "creation time" and "destroy-time" provisioners which you can set to run when a resource is being created or destroyed.

Although provisioner looks like a good feature, HashiCorp recommends not using provisioners unless the cloud provider doesn't offer a mechanism that runs commands or scripts through resources. One Con provisioners have, is that they are not tracked by Terraform state.

Provisioners are recommended to be used if Terraform declarative model doesn't already offer the action to be taken. If while applying configuration the code exits with non-zero code, it's considered failed and the resource is tainted.

The configuration below runs two provisioners on a null resource:

one that runs in resource create and another (that contains destroy) that runs when destroying the resource, the two provisioners add 0, and 1 respectively to a file named status.txt. So as you can tell, when first provision the null_resoure there will be in status.txt 0 , and after destroying it, there will be 01 in that file.

Conclusion:

I hope this blog helps you up in understanding a bit more about terraform state, variables, outputs, and provisioners. In the next blog, I'm going to talk about Terraform modules, built-in functions, and dynamic blocks.

Part 3/3: Modules, Built-in Functions, Type Constraints, and Dynamic Blocks: link here

Terraform 101 - Part 1/3: History, Workflow, and Resource Addressing | By Ali KHYAR

Ali KHYAR — Fri, 07 Oct 2022 23:11:00 +0000

About Terraform:

Terraform is an open-source Infrastructure as Code (IaC) software tool, which simply means that it enables you to write resource deployment usually for the cloud in a human-readable way. IaC is one of the better DevOps practices that track infrastructure code and deploy it in a repeatable/predictable manner.

Back in 2011 when CloudFormation of AWS appeared, one of the creators of terraform saw the need for an open-source, cloud-agnostic tool that is not bound to one cloud provider, and that has the same functionalities as CloudFormation. The idea of terraform appeared in 2011, but the first lines of Golang code weren't written until July 2014, and version 0.1 only had support for AWS and DigitalOceans.

Terraform uses its own language known as Hashicorp configuration language (HCL), which was created to have both human and machine-friendly syntax, it has a native syntax intended to be pleasant to humans in writing/reading, and it has a JSON based variant that is easier for machines to generate and parse.

Terraform Workflow:

The core terraform workflow has three steps:

write: writing your code.
plan: reads the code and preview changes, basically, it makes Terraform mock what the code will apply. you can do any number of iterations between the write and plan phase.
apply: tell Terraform to provision real infrastructure and update the state file

One other command that you will need to know is: terraform destroy, which looks at recorded, stored state file created during deployment and destroys all resources created, it is a non-reversible command so it should be used with caution.

Terraform Init:

Terraform expects to be invoked from a working directory that contains configuration files written in the Terraform language, and uses configuration content from this directory, and also uses the directory to store settings, caching plugins, and modules, and sometimes state data. Hence, if the working directory wasn't specified we should do so by using the terraform command: terraform init which is like git init for terraform, that downloads modules and plugins (I will cover modules in part 2/3), and sets up the backend for storing terraform state file; a mechanism which terraform tracks resources with. Note that if you run a command that relies on initialization without first initializing, the command will fail with an error and explain that you need to run init.

When initializing the working directory two files appear alongside Terraform configuration files :

.terraform: a hidden directory, used to manage cached provider plugins and modules, a record of active workspace, and a record of backend configuration.
State data file, if the configuration uses the default local backend. This is managed by Terraform in a terraform.tfstate file (if the directory only uses the default workspace) or a terraform.tfstate.d directory (if the directory uses multiple workspaces).

Terraform Configuration:

A Terraform configuration will always start with like below, which tells Terraform what provider we will interact with and define its config:

provider "aws"{
    region = "us-east-1"
}

in the above configuration the word provider is a reserved keyword that fetches whatever provider is following the keyword which in this case is aws, and between braces, we have the config parameters which help to define the arguments of the AWS provider. The configuration parameters will vary depending on the used provider.

- - - - - - -

The most important thing you'll configure with Terraform is resources. Resources are a component of your infrastructure. It might be some low-level component such as a physical server, virtual machine, or container. Or it can be a higher-level component such as an email provider, DNS record, or database provider. Let's look at the example below which deploys an AWS ec2 instance:

resource "aws_instance" "web" {
    ami           = "ubuntu-focal-20.04-amd64-server"
    instance_type = "t3.micro"
}

resource is a reserved keyword that tells Terraform to consider the block as a resource block, "aws_instance" is a resource provided by terraform provider, every provider is a plugin that implements resource types, for example, to look at AWS visit https://registry.terraform.io/providers/hashicorp/aws/latest, then "web" which can be given any arbitrary name by the user. The parameters between braces are resource config arguments, which in this case we only specified the AMI image and the instance type. If you want or need to go crazy with the configuration check https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance.
- - - - - - -
Another block you need to know about in Terraform is the data source block, The main difference between a data source block and a resource block is that a data source block is fetching and tracking details of an already existing resource, whereas a resource block creates a resource from scratch, look at the following example which stores id information about an already deployed VM:

data "aws_instance" "apache-server" {
   instance_id = "some-random-id"
}

Resource Addressing:

Let's imagine a scenario where you need to call the deployed ec2 instance, you can do that by specifying the resource type which is aws_instance then separated by a dot with the user's arbitrary given name. So in the scenario above, we can reference the resource with aws_instance.web . The same process goes for data sources blocks.

It's recommended that if you want to reference a property of a resource inside of the same resource to use self attribute. take a look at the following example in which in line 18 we reference the IP address of the deployed ec2 instance using self keyword:

- - - - - -
In the next blog, we will talk more about state, variables, provisioners, and modules.

DEV Community: Ali KHYAR

Apache Airflow - Deep Dive | All you need to know about Airflow

Table of Contents

What is a data pipeline?

Airflow for creating and orchestating data pipelines

DAGs and Operators:

Airflow's single node architecture vs multi-nodes architecture?

Airflow Setup

Airflow UI Views:

DAGs in action:

DAGS Scheduling:

Backfilling And CatchUp

Databases and Executors:

SequentialExecutor:

LocalExecuter:

CeleryExecutor:

Celery parameters (Good to Know):

Grouping tasks:

SubDAGs:

TaskGroups:

Sharing data with XComs:

Tasks conditioning:

Trigger rules:

Conclusion:

Terraform 101 - Part 3/3: Modules, Built-in Functions, Type Constraints, and Dynamic Blocks | By Ali KHYAR

Modules:

Built-In Functions:

Type Constraints:

Dynamic Blocks:

Conclusion:

Terraform 101 - Part 2/3: State, Variables, Outputs, and provisioners | By Ali KHYAR

State:

Concepts and local storage:

State Storage:

Variables:

Outputs:

Provisioners:

Conclusion:

Terraform 101 - Part 1/3: History, Workflow, and Resource Addressing | By Ali KHYAR

About Terraform:

Terraform Workflow:

Terraform Init:

Terraform Configuration:

Resource Addressing: