<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ali KHYAR</title>
    <description>The latest articles on DEV Community by Ali KHYAR (@alikhyar).</description>
    <link>https://dev.to/alikhyar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F939081%2F6c775f36-9339-4098-8f89-a5640a68ffe7.jpeg</url>
      <title>DEV Community: Ali KHYAR</title>
      <link>https://dev.to/alikhyar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alikhyar"/>
    <language>en</language>
    <item>
      <title>Apache Airflow - Deep Dive | All you need to know about Airflow</title>
      <dc:creator>Ali KHYAR</dc:creator>
      <pubDate>Fri, 24 Feb 2023 06:03:00 +0000</pubDate>
      <link>https://dev.to/alikhyar/apache-airflow-deep-dive-all-you-need-to-know-about-airflow-1pan</link>
      <guid>https://dev.to/alikhyar/apache-airflow-deep-dive-all-you-need-to-know-about-airflow-1pan</guid>
      <description>&lt;p&gt;This blog was originally published on ali-khyar.com, if you are interested in learning moreon similar subjects &lt;a href="https://ali-khyar.com/" rel="noopener noreferrer"&gt;visit here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;p&gt;What is a data pipeline?  &lt;/p&gt;

&lt;p&gt;Airflow for creating and orchestating data pipelines&lt;/p&gt;

&lt;p&gt;DAGs and Operators&lt;/p&gt;

&lt;p&gt;Airflow's single node architecture vs multi-nodes architecture&lt;/p&gt;

&lt;p&gt;Airflow Setup&lt;/p&gt;

&lt;p&gt;Airflow UI Views&lt;/p&gt;

&lt;p&gt;DAGS in Action&lt;/p&gt;

&lt;p&gt;DAGS Scheduling&lt;/p&gt;

&lt;p&gt;Backfilling And CatchUp&lt;/p&gt;

&lt;p&gt;Databases and Executors (Sequential, Local and Celery)&lt;/p&gt;

&lt;p&gt;Grouping tasks (SubDAGs and TaskGroups)&lt;/p&gt;

&lt;p&gt;Sharing data with XComs&lt;/p&gt;

&lt;p&gt;Tasks conditioning (BranchPythonOperator)&lt;/p&gt;

&lt;p&gt;Trigger rules&lt;/p&gt;

&lt;p&gt;Conclusion&lt;/p&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a data pipeline?
&lt;/h2&gt;

&lt;p&gt;A data pipeline is a set of processes or tools that are used to move data from one place to another, and to transform and process that data along the way.&lt;br&gt;
A simple example of a data pipeline might involve extracting data from a source system, such as a database or a CSV file, and then using a series of transformation steps to clean and prepare the data for loading into a destination system, such as a data warehouse or a machine learning model.&lt;/p&gt;

&lt;p&gt;A data pipeline typically includes several stages, such as data extraction, data transformation, data validation, and data loading. These stages may involve a combination of manual and automated processes, and may include a variety of different tools and technologies.&lt;/p&gt;

&lt;p&gt;Data pipeline can be used in various use cases like Extracting, Transforming and Loading (ETL), Extract, Transform, Load and Analyze (ETLA), Extract, Load and Transform (ELT) and many more.&lt;/p&gt;

&lt;p&gt;The complexity of the pipeline can vary depending on the scope and purpose; it can be as simple as gathering and combining data from multiple sources into a single, unified view or database. This can be done for a variety of reasons, such as to improve data quality, reduce data redundancy, or to make it easier to analyze and report on the data.&lt;/p&gt;

&lt;p&gt;For example, imagine a company that has been acquired by another company and now has multiple databases containing information about customers, sales, and inventory. In order to more easily analyze and report on the company's performance, the data from these multiple databases would need to be consolidated into a single database. This process would involve extracting the relevant data from each of the individual databases, cleaning and standardizing the data, and then loading it into the consolidated database.&lt;/p&gt;

&lt;p&gt;A simple sample of an ETL in data pipeline, using Python, is the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Importing necessary libraries: pandas to load data from source, sqlalchemy’s create_engine to connect to pgsql database&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_engine&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;read sample file (Extract)&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;adding total column (Transform)&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quantity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Load data into destination (Load)&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://username:password@host:port/database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Overall, data pipelines allow organizations to easily collect, process, and analyze large amounts of data, which can help make data-driven decisions and improve business operations, and its architecture is typically based on the following components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data sources: These are the systems or sources from which data is extracted, such as databases, file systems, or external APIs.&lt;/li&gt;
&lt;li&gt;Data extraction: This step involves extracting the data from the sources and converting it into a format that can be used downstream in the pipeline.&lt;/li&gt;
&lt;li&gt;Data transformation: This step involves cleaning, formatting, and transforming the data to make it usable for the next step in the pipeline.&lt;/li&gt;
&lt;li&gt;Data loading: This step involves loading the transformed data into the target system, such as a data warehouse or a data lake.&lt;/li&gt;
&lt;li&gt;Data validation: This step involves validating the data to ensure that it meets the quality standards and requirements before it is loaded into the target system.&lt;/li&gt;
&lt;li&gt;Data monitoring: This step involves monitoring the pipeline to ensure that it is running smoothly and that data is flowing through it as expected.&lt;/li&gt;
&lt;li&gt;Error handling: This step involves handling any errors that may occur during the pipeline and alerting the appropriate parties.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some data pipeline architectures may also include additional steps such as data enrichment, or data warehousing for data analysis and reporting.&lt;/p&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Airflow for creating and orchestating data pipelines
&lt;/h2&gt;

&lt;p&gt;As we saw, Data pipelines are tasks that can either be successive or parallel between a source system and a target one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fev9mavjifo59do3fe8fo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fev9mavjifo59do3fe8fo.png" alt="Airflow - data pipeline"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Airflow is a popular open-source tool used to manage and schedule data pipeline tasks. It allows for the creation, management, and monitoring of workflows, which can include multiple tasks that are dependent on each other. These tasks can be defined as Python functions and can be scheduled to run on a specific schedule or triggered by certain events. Airflow also provides a web interface for monitoring the status of tasks and troubleshooting any issues that may arise.&lt;/p&gt;

&lt;p&gt;The tools is composed from 5 essential components/services:\&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;webServer: which is a flask server that is serving the UI with Gunicorn&lt;/li&gt;
&lt;li&gt;Scheduler: The daemon in charge of workflows’ scheduling&lt;/li&gt;
&lt;li&gt;Metastore: a database where metadata is stored, a database is compatible as long as it supports sqlalchemy (Postgres recommended)&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Executor: defines how tasks should be executed, the most common used ones are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential Executor: This is the simplest executor, which runs tasks sequentially on the same machine as the Airflow scheduler. It is the default executor and is suitable for small and simple use cases.&lt;/li&gt;
&lt;li&gt;Local Executor: This executor runs tasks concurrently on the same machine as the Airflow scheduler. It is similar to the Sequential Executor but allows for parallelism.&lt;/li&gt;
&lt;li&gt;Celery Executor: This executor runs tasks concurrently on a separate worker machine or a group of worker machines. It uses the Celery distributed task queue to manage the execution of tasks.&lt;/li&gt;
&lt;li&gt;Kubernetes Executor: This executor runs tasks within a Kubernetes cluster. It allows for scaling the number of worker nodes up or down based on task demands.&lt;/li&gt;
&lt;li&gt;Dask Executor: This executor runs tasks concurrently on a separate worker machine or a group of worker machines. It uses the Dask distributed task scheduler to manage the execution of tasks.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;The process or subprocess executing the task&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  DAGs and Operators:
&lt;/h2&gt;

&lt;p&gt;When you start learning Airflow you hear DAGs a lot, everyone is talking about DAGs like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftz2vhbthhni55f3oce5d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftz2vhbthhni55f3oce5d.png" alt="Airflow - the basics"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;a DAG is an abbreviation for “Direct Acyclic Graph”, It is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. DAGs define how tasks are executed, what their dependencies are, and what the order of execution should be.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid4hsm03dqyu5xczwp23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid4hsm03dqyu5xczwp23.png" alt="Airflow - the graph"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;They are written in Python and can be scheduled to run at a specific interval or triggered by an external event. A &lt;code&gt;DAG has 2 main components&lt;/code&gt; which are &lt;code&gt;Tasks&lt;/code&gt; and &lt;code&gt;Operators&lt;/code&gt;. Operators are used to specify the dependencies between tasks and the order in which they should be executed. For example, an operator can be used to specify that one task should be run only after another task has completed successfully. Here's an example of a simple DAG file in Apache Airflow that defines a single task that runs a bash operator’s task:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw49wa9kc0vnwox4qxhvf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw49wa9kc0vnwox4qxhvf.png" alt="Airflow - DAG and TASK"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can divide operators into 3 types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;action operators&lt;/code&gt;: the ones executing bash commands or python functions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;transfer operators&lt;/code&gt;: allow you to transfer data between systems, such as SftpOperator : This operator which is used to transfer files via SFTP protocol&lt;/li&gt;
&lt;li&gt; &lt;code&gt;sensor operators&lt;/code&gt;: used to check criterias (condition or state) before continuing executing the next task/tasks hence the word sense (wait/percept). &lt;code&gt;AzureBlobStorageSensor for instance&lt;/code&gt;: This operator which is used to check for the existence of a specific blob in an Azure Blob Storage container and wait until it appears or disappears.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition, In Airflow we have the other following concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task instance: what a task called when executed &lt;/li&gt;
&lt;li&gt;Workflow: combination of: dags with operators with tasks and with dependencies...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;Important notes:&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Airflow is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a data streaming solution: if you want to process data every seconds better not to go with airflow&lt;/li&gt;
&lt;li&gt;a data processing framework: if you have TB of data, better go with spark or other solutions optimized to do such tasks, and if you challenge it you may end up with memory overflow error. Still, you can use SparkSubmitOperator to trigger a Spark job somewhere outside Airflow.&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Airflow's single node architecture vs multi-nodes architecture?
&lt;/h2&gt;

&lt;p&gt;when starting with Airflow you are probably using a single machine, is called in Airflow terms a single node architecture, in which Airflow components are interacting as follow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8w9hwwm03xoho15v0l4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8w9hwwm03xoho15v0l4x.png" alt="Airflow - single node"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Those components are  talking with the help of Metastore. The Queue in the executer may not be the best architecture, but it’s suited for single-node architecture, dev environments, as well as limited tasks.&lt;/p&gt;

&lt;p&gt;The following is how pipelines are executed in the single node architecture, but also applicable to the multi-node architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;you will have a dags folder for example &lt;em&gt;dags-folder&lt;/em&gt;, where data pipelines code is stored.&lt;/li&gt;
&lt;li&gt;both webserver service and scheduler parse &lt;em&gt;dags-folder&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;When it's time for a DAG to get executed, the Scheduler creates  a dagRun object (an instantiation of the DAG file ) in the Metastore.&lt;/li&gt;
&lt;li&gt;when the dagRun state becomes Ready, it creates a TaskInstance object.&lt;/li&gt;
&lt;li&gt; The Scheduler sends the TaskInstance to the Executor.&lt;/li&gt;
&lt;li&gt; The Executer runs  the TaskInstance, and updates the status of TaskInstance in the metastore&lt;/li&gt;
&lt;li&gt; Once the taskInstance is done, the executer updates its state.&lt;/li&gt;
&lt;li&gt;the scheduler checks dagRun status, if done the WebServer update the status in the UI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To &lt;code&gt;execute as many tasks as you want&lt;/code&gt;, you should use the &lt;code&gt;Multi-Node Architecture&lt;/code&gt; (AKA, Celery), where the Queue will be an external third party service like RabbitMQ or Redis. With Celery, you can have many tasks running and spread on different nodes (workers). Multi-Node Architecture looks like figure below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl0opkyeot76at6f2fny6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl0opkyeot76at6f2fny6.png" alt="Airflow - multi node"&gt;&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Airflow Setup
&lt;/h2&gt;

&lt;p&gt;You can install Airflow with pip/pip3 using the following command &lt;code&gt;pip3 install apache-airflow==version –constraint path-to-constraints&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;--constraint&lt;/code&gt; or &lt;code&gt;-c&lt;/code&gt; flag specifies a path to a file that contains version constraints for the package being installed. This file is used to specify the version of the package(s) that should be installed, rather than the latest version available.&lt;br&gt;
The &lt;code&gt;path-to-constraints&lt;/code&gt; after the --constraint flag is the path of a file that contains version constraints. This file is a plain text file that lists the package name and the version number or version range that is allowed for that package. It's used to specify the version of the package that should be in&lt;br&gt;
stalled, rather than the latest version available.&lt;/p&gt;

&lt;p&gt;In order to initialize the metastore we run the command &lt;code&gt;airflow db init&lt;/code&gt;, this command will also create some additional folders and files(logs, configuration files…). By default &lt;code&gt;if you don’t specify other databases&lt;/code&gt;to use, Airflow will create a &lt;code&gt;sqlite database&lt;/code&gt; named airflow.db.&lt;br&gt;
 in Order to start the webserver run &lt;code&gt;airflow webserver&lt;/code&gt;, and visit l&lt;code&gt;ocalhost:8080&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;In airflow no user is created by default&lt;/code&gt;, you should create them manually from the cli, to create a user  run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="nt"&gt;--------------------------------------------------------------&lt;/span&gt;
airflow &lt;span class="nb"&gt;users &lt;/span&gt;create &lt;span class="nt"&gt;-u&lt;/span&gt; admin &lt;span class="nt"&gt;-p&lt;/span&gt; admin &lt;span class="nt"&gt;-f&lt;/span&gt; Ali &lt;span class="nt"&gt;-l&lt;/span&gt; Khyar &lt;span class="nt"&gt;-r&lt;/span&gt; Admin &lt;span class="nt"&gt;-e&lt;/span&gt; admin@airflow.com
    &lt;span class="nt"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Airflow UI Views:
&lt;/h2&gt;

&lt;p&gt;Workflow visualization is crucial for understanding and managing workflows. The following are five views that can be used to visualize Airflow's workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Tree View&lt;/code&gt;: The Tree View is a hierarchical view of all the tasks within a DAG. It shows all the tasks and their dependencies in a tree structure. This view is useful for understanding the overall structure of a workflow and for identifying failed or skipped tasks.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Graph View&lt;/code&gt;: The Graph View displays a DAG and its tasks in a graphical view. This view is useful for visualizing the structure of a workflow and identifying dependencies between tasks. Users can zoom in and out and rearrange tasks to get a better understanding of the workflow.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Gantt Chart&lt;/code&gt; View: The Gantt Chart View displays the tasks and their dependencies in a timeline. This view is useful for identifying the start and end times of tasks and how they relate to each other.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Task Instance Details View&lt;/code&gt;: The Task Instance Details View displays detailed information about a specific task, including its start and end time, duration, and status. Users can also view logs for the task, which can help with debugging and troubleshooting.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Code View&lt;/code&gt;: The Code View displays the code that defines a DAG. This view is useful for understanding how a workflow is defined and for making changes to the code.&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  DAGs in action:
&lt;/h2&gt;

&lt;p&gt;A we already said a DAG represents a data pipeline, which consists of tasks (nodes) and dependencies (edges) between them, tasks are created using operators. There are many types of operators available in Airflow, including PythonOperator, BashOperator, and SQLiteOperator....&lt;/p&gt;

&lt;p&gt;Each operator represents a specific task in the pipeline. For example, let's say we have a data pipeline that involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;extracting user data from an API&lt;/li&gt;
&lt;li&gt;processing it using Python functions&lt;/li&gt;
&lt;li&gt;storing it in a SQLite database.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We could create the following tasks using Airflow operators:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQLiteOperator: create table&lt;/li&gt;
&lt;li&gt;HttpSensor: check if API is available&lt;/li&gt;
&lt;li&gt;PythonOperator: extract user data&lt;/li&gt;
&lt;li&gt;PythonOperator: process user data&lt;/li&gt;
&lt;li&gt;BashOperator: store user data in SQLite database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We would then define a DAG folder to store our DAGs, and create a DAG file that specifies the order in which these tasks should be executed. It's important to note that &lt;code&gt;combining cleaning and processing data into one Airflow operator is not a best practice&lt;/code&gt;, as it can lead to issues in the pipeline. Instead, &lt;code&gt;each task should be its own operator&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Code example of the above scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.sqlite_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SQLiteOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.http_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleHttpOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.bash_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;airflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;depends_on_past&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_on_failure&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;email_on_retry&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retry_delay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# creating an example dag 
&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;example_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A DAG to demonstrate the use of Airflow operators&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# task to create SQLite table
&lt;/span&gt;&lt;span class="n"&gt;create_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SQLiteOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;create_table&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT NOT NULL, email TEXT NOT NULL)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my_db&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# task to check if API is available
&lt;/span&gt;&lt;span class="n"&gt;check_api&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleHttpOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;check_api&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;api/health&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GET&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;http_conn_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my_api&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# task to extract user data
&lt;/span&gt;&lt;span class="n"&gt;extract_user_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extract_user_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_extraction_function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;op_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;param1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;param2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# task to process user data
&lt;/span&gt;&lt;span class="n"&gt;process_user_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;process_user_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;my_processing_function&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;op_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;param1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;param2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;value2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# task to store user data in SQLite database
&lt;/span&gt;&lt;span class="n"&gt;store_user_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;store_user_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python /path/to/my_script.py --arg1 value1 --arg2 value2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# define task dependencies
&lt;/span&gt;&lt;span class="n"&gt;create_table&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;check_api&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;extract_user_data&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;process_user_data&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;store_user_data&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To test our DAG, we can use the &lt;code&gt;airflow tasks test&lt;/code&gt; command, which allows us to &lt;code&gt;test individual tasks&lt;/code&gt; within the DAG.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;To share data between our tasks we can use XComs mechanism&lt;/code&gt;. XComs allow us to pass data between tasks by creating a key-value pair in the Metastore.&lt;br&gt;
For example, the &lt;em&gt;extract user data&lt;/em&gt; task could create an XCom containing the extracted user data as a JSON object, which could then be retrieved by the &lt;em&gt;process user data&lt;/em&gt; task using the XCom API. We will see more about this later in this blog.&lt;/p&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  DAGS Scheduling:
&lt;/h2&gt;

&lt;p&gt;One of the key features of Airflow is its ability to schedule tasks based on a variety of criteria. you can define a task's start date and its scheduled interval. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The start date determines when the task should begin running.&lt;/li&gt;
&lt;li&gt;scheduled interval determines how often the task should be executed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, let's say we have a task that needs to run every 10 minutes, starting on 01 January 2020 at 10:00am, we can define such task like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;airflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;depends_on_past&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retries&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retry_delay&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dummy_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dummy_task&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dummy_task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dummy_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dummy_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the above code create &lt;em&gt;dummy_dag&lt;/em&gt;, with the start date of 01 January 2020 at 10:00am and a scheduled interval of 10 minutes. We've also defined a PythonOperator called &lt;em&gt;dummy_task&lt;/em&gt; that will be executed every 10 minutes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;One thing to note&lt;/code&gt;&lt;/strong&gt;: is that the task won't start executing immediately at 10:00am. Instead, Airflow will wait until the first scheduled interval has elapsed before triggering the task. In this case, that means the task will be triggered at 10:10am on January 1st, 2020. This is referred to as the &lt;code&gt;execution date&lt;/code&gt; in Airflow.&lt;/p&gt;
&lt;/blockquote&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Backfilling And CatchUp
&lt;/h2&gt;

&lt;p&gt;super important concept in DAGs are Backfilling And CatchUp.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Backfilling&lt;/code&gt; is the process of running past instances of a DAG that were missed due to a schedule not being set up at the time or when the DAG was paused. it can be achieved by setting the &lt;code&gt;start_date&lt;/code&gt; parameter of the DAG and using the airflow &lt;code&gt;backfill command&lt;/code&gt;. &lt;em&gt;This command can be used to manually trigger a DAG run for all instances between the start_date and the current date&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Catchup&lt;/code&gt; is a feature in Airflow that allows a DAG to process all missed DAG runs during a period of time when the DAG was inactive, either due to the DAG being paused or the scheduler not running. &lt;code&gt;By default, catchup is set to True&lt;/code&gt;, which means that the scheduler will process any missed DAG runs when the DAG is restarted. This ensures that all historical data is processed and accounted for.&lt;br&gt;
&lt;br&gt;&lt;br&gt;&lt;br&gt;
Here's &lt;code&gt;an example to illustrate the use of backfilling and catchup&lt;/code&gt; in Airflow: &lt;/p&gt;

&lt;p&gt;Let's say we have a DAG &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scheduled to run @daily&lt;/li&gt;
&lt;li&gt;with a start date of January 1, 2023&lt;/li&gt;
&lt;li&gt;The DAG is paused for 3 days&lt;/li&gt;
&lt;li&gt;then restarted on January 5, 2023.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;&lt;code&gt;Since catchup is set to True by default&lt;/code&gt;, Airflow will automatically run DAG instances for January 2, 3, and 4, in addition to the January 5 instance.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To deactivate CatchUp you can set it up to False in the DAG instantiation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;....)&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;: all dates in airflow are in UTC: don’t get confused if dags are not executed in your local timezone&lt;br&gt;
you can change that airflow.cfg in default_timezone = “”&lt;br&gt;
not recommended to change that keep everything in UTC&lt;/p&gt;
&lt;/blockquote&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Databases and Executors:
&lt;/h2&gt;

&lt;p&gt;Execetors in Airflow are what defines how many tasks u can execute in parallel.&lt;/p&gt;

&lt;p&gt;It's important to understand the order in which tasks will be executed. Specifically, if you have two tasks that are dependent on each other, which one will be executed first?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gs42lz29x9ufz0s01ky.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gs42lz29x9ufz0s01ky.png" alt="Airflow - Tasks"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the example above, we have a DAG with four tasks: task1, task2, task3, and task4. Task1 is the first task in the sequence, but which of the next two tasks - task2 or task3 - will be executed first?&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;SequentialExecutor:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The answer to the previous question is that they will be executed sequentially, one after the other. This is because they are connected by a &lt;em&gt;bit shift&lt;/em&gt; operator (&amp;gt;&amp;gt;), which tells Airflow to execute the tasks in order. In this case, task2 will be executed first, followed by task3. This &lt;code&gt;sequential execution is useful for debugging&lt;/code&gt;, as it allows you to see the output of each task before moving on to the next one. To configure your DAG for sequential execution, you'll need to set the &lt;code&gt;executor&lt;/code&gt; parameter to &lt;code&gt;SequentialExecutor&lt;/code&gt; in your Airflow configuration file, This Executor is the default if configuration file is not touched. You'll also need to specify a &lt;code&gt;sql_alchemy_conn&lt;/code&gt; parameter, which tells Airflow where to store the metadata for your DAG.&lt;/p&gt;

&lt;p&gt;You can discover the values for these parameters by running the following commands in your terminal (under pipenv):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;    &lt;span class="nt"&gt;--------------------------------------------------------------&lt;/span&gt;
airflow config get-value core sql_alchemy_conn
airflow config get-value core executor
    &lt;span class="nt"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;p&gt;&lt;strong&gt;&lt;code&gt;It's worth noting&lt;/code&gt;&lt;/strong&gt; that &lt;code&gt;if you're using a SQLite&lt;/code&gt; database to store your DAG metadata, &lt;code&gt;you won't be able to run multiple write operations at the same time&lt;/code&gt;. This means that if you have multiple tasks that are trying to write to the database simultaneously, you may run into issues. If you anticipate a high volume of write operations, you may want to consider using a different database backend that can handle concurrent writes more effectively.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;LocalExecuter:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;As you can see &lt;code&gt;SequentialExecutor is not that useful if you want to run multiple tasks in parallel on a single machine&lt;/code&gt;. Here comes &lt;code&gt;LocalExecuter&lt;/code&gt; to help to increase the efficiency of workflows and reduce overall execution time.&lt;/p&gt;

&lt;p&gt;To change the executer to LocalExecuter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, you should have a PostgreSQL database&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Install the necessary packages by running the command: &lt;code&gt;pip install ‘apache-airflow[postgres]’&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Update the Airflow configuration file (airflow.cfg) by changing the &lt;code&gt;sql_alchemy_conn&lt;/code&gt; parameter &lt;code&gt;in the [core] section&lt;/code&gt; to the Postgres connection string&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Verify that the database is set up correctly by running the command &lt;code&gt;airflow db check&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Change the executor to LocalExecutor by updating the &lt;code&gt;executor parameter in the [core]&lt;/code&gt; section of airflow.cfg.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Initialize the database by running the command airflow db init.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Create a user account by running the command &lt;code&gt;airflow users create --username admin --password admin --role admin --firstname ali --lastname khyar --email admin@airflow.com&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Start the Airflow webserver and scheduler by running the commands &lt;code&gt;airflow webserver&lt;/code&gt; and &lt;code&gt;airflow scheduler&lt;/code&gt;, respectively.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Run your DAG&lt;/code&gt; and &lt;code&gt;check the Gantt view&lt;/code&gt; to see parallel execution in action.&lt;br&gt;
parallel on a single machine &lt;code&gt;with localExecutor task2 and task3 in the same time (subprocesses)&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The above steps will allow us to use LocalExecutor instead of the SequentialExecuter, hence running tasks in parallel and improve execution time.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;p&gt;Ok LocalExector is nice, it allows us to run tasks in parallel in a single machine, b&lt;code&gt;ut what if our single machine went out of resources? how we can Scale Airflow?&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzakuew214ji40jb4xk0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzakuew214ji40jb4xk0.png" alt="Airflow - data pipeline"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here comes KubernetesExecutor and CeleryExecutor to save.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;CeleryExecutor:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Celery executors allows Airflow to scale worker nodes, Using the distributed task system provided by Celery to spread execution among multiple machine&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpzpmo9s6qm5gr8hn5mf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpzpmo9s6qm5gr8hn5mf.png" alt="Airflow - celery"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To configure CeleryExecutor, we follow the below steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Install the necessary packages by running the command &lt;code&gt;pip install ‘apache-airflow[celery]&lt;/code&gt;’&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Install Redis by running the command &lt;code&gt;sudo apt update &amp;amp;&amp;amp; sudo apt install redis-server&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modify the Redis configuration file (&lt;code&gt;sudo nano /etc/redis/redis.conf&lt;/code&gt;) by adding the following lines:&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nt"&gt;-------------------------&lt;/span&gt;
supervised systemd
&lt;span class="nt"&gt;-------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Restart the Redis service by running the commands &lt;code&gt;sudo systemctl restart redis.service&lt;/code&gt; and &lt;code&gt;sudo systemctl status redis.service&lt;/code&gt; to ensure that it is running correctly.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;In the Airflow configuration file (airflow.cfg), change the executor to &lt;code&gt;CeleryExecutor&lt;/code&gt; and &lt;code&gt;update the broker_url parameter&lt;/code&gt; to &lt;code&gt;redis://localhost:6379/0&lt;/code&gt;  (where 0 is the name of the database) and the &lt;code&gt;result_backend&lt;/code&gt;parameter to &lt;code&gt;the same value as sql_alchemy_conn&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;To interact with Redis from Airflow, install the apache-airflow[redis] package: &lt;code&gt;pip install 'apache-airflow[redis]'&lt;br&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Celery  parameters (Good to Know):&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In order to optimize tasks' execution with CeleryExecutor you can adjust the below parameters in airflow.cfg file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;parallelism&lt;/code&gt;: This parameter specifies the maximum number of tasks that can be executed concurrently across the entire Airflow installation. default value is 32, &lt;code&gt;If you set it to 1, Airflow will behave like a sequential executor&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;dag_concurrency&lt;/code&gt;: This parameter limits the maximum number of tasks that can be executed concurrently for a specific DAG. By default it is to 16, but &lt;code&gt;it can be overridden on a DAG level by setting the concurrency parameter&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;max_active_runs_per_dag&lt;/code&gt;: This parameter limits the maximum number of DAG runs that can be executed concurrently for a specific DAG. default value is 16, but it &lt;code&gt;can be overridden on a DAG level by setting the max_active_runs parameter&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;Note:&lt;/code&gt;&lt;/strong&gt; the &lt;code&gt;priority&lt;/code&gt; of these parameters &lt;code&gt;is parallelism &amp;gt; dag_concurrency&lt;/code&gt;. This means that if you have set parallelism to a low value, it will limit the number of tasks that can be executed concurrently across the entire Airflow installation regardless of the value of dag_concurrency.However, if you have set parallelism to a high value, it will take priority over dag_concurrency, and the maximum number of tasks that can be executed concurrently for a specific DAG will be limited by dag_concurrency.&lt;/p&gt;
&lt;/blockquote&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Grouping tasks:
&lt;/h2&gt;

&lt;p&gt;Sometimes tasks within a DAG can be complex to manage if there's many or if complex processing is involved, so you need to group task, and you can either go with SubDAGs (not recommended but good to know) or with TaskGroups.&lt;br&gt;
The idea is to move from something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw580c59e7cf4shvz4hwy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw580c59e7cf4shvz4hwy.png" alt="Airflow - from"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;to something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6iap70hp7j6vfm6o5zg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6iap70hp7j6vfm6o5zg.png" alt="Airflow - to"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;SubDAGs:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A SubDAG allows you to bundle related tasks within a DAG into a manageable DAG (DAG within a DAG).&lt;br&gt;
You create SubDAGs by creating a function that returns a DAG object (encapsulate tasks), here's an example of a SubDAG named &lt;em&gt;subdag_task&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.bash_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.subdag_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SubDagOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;subdag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subdag_task_1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SubDAG task 1&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;task_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subdag_task_2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SubDAG task 2&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;task_2&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;airflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parent_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parent_task_1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Parent DAG task 1&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;subdag_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SubDagOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subdag_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;subdag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;subdag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parent_dag.subdag_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;task_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parent_task_2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Parent DAG task 2&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dag&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;subdag_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;task_2&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SubDAGs seems like a cool feature but it has its dark side, like everything in life. Although if you change the Executor for Airflow, tasks within the SubDAG will use SequentialExecutor which will slow total time of execution. Plus, you may fall into deadlocks (DAGs waiting each other to complete, causing a circular dependency that cannot be resolved).&lt;/p&gt;

&lt;p&gt;Hence TaskGroups were introduced in Airflow 2.0&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;TaskGroups:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;TaskGroups allows you to better group tasks and manage them easily. TaskGroups are defined with &lt;code&gt;TaskGroup&lt;/code&gt; class like follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.utils.task_group&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TaskGroup&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;TaskGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;group_1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;group_1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;task_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="n"&gt;task_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;task_4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;then you can set depency the usual way using bitwise operator (&amp;gt;&amp;gt;) or using &lt;code&gt;set_upstram&lt;/code&gt;, as for the example above you can use:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;task_4.set_upstream(group_1)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can go crazy as much as you want with TaskGroups, to do things such as nested group, like in the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.utils.task_group&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TaskGroup&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;TaskGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;group_1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;group_1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;task_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;TaskGroup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subgroup_1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;subgroup_1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;task_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
            &lt;span class="n"&gt;task_4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
        &lt;span class="n"&gt;task_5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="n"&gt;task_6&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By Grouping tasks with TaskGroup you will get the most out of making DAGs code cleaner more manageable and easy to read. this is powerful _&lt;em&gt;init&lt;/em&gt;_?&lt;/p&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sharing data with XComs:
&lt;/h2&gt;

&lt;p&gt;We saw somewhere before XComs and we didn't get into it in detail. Xcoms in Airflow are a way of exchanging data between tasks, they are basically key value pair with a timestamp. they are getting used by push pull operations. let's suppose we have a task that download files from a storage account and we need to pass a list of downloaded file (file list) to an downstream task, here's how the push operation can be done from a task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="n"&gt;file_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file1.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file2.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file3.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;task_instance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_instance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;task_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file_list&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;file_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and a pull peration in another task can be done like in below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="n"&gt;task_instance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_instance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;file_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file_list&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;file_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# download the file
&lt;/span&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tasks conditioning:
&lt;/h2&gt;

&lt;p&gt;I don't know if this thing is called Tasks conditioning XD hhhhhhh but anyway the idea is how to execute what downstream task based on an upstream value which is pushed in XComs, Such operations can be done using &lt;code&gt;BranchPythonOperator&lt;/code&gt; to define tasks execution rules, here's an example where &lt;code&gt;choose_next_task&lt;/code&gt; is the function which triggers next task to be executed &lt;code&gt;based on the xcom value of data_type&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;choose_next_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_instance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;data_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="n"&gt;branching_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BranchPythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;branching_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;choose_next_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provide_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;task_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data type A processed&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;task_B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data type B processed&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;task_C&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data type not recognized&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;branching_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_C&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Trigger rules:
&lt;/h2&gt;

&lt;p&gt;Sometimes, we don't need to ensure that all upstream tasks should succeed before running a downstream task, or maybe we need to know if any failed. This can be done through trigger rules, trigger rules enables you to run downstream tasks based on the final execution status of upstream tasks, there are 9 different trigger rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;all_success: (default) all parents have succeeded&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;all_failed: all parents are in a failed or upstream_failed state&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;all_done: all parents are done with their execution&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;none_failed: all parents have not failed (failed or upstream_failed) i.e. all parents have succeeded or been skipped&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;none_skipped: no parent is in a skipped state, i.e. all parents are in a success, failed, or upstream_failed state&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;dummy: dependencies are just for show, trigger at will&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's bellow an example where if any task fails, the downstream tasks will fail as well, this is achieved using &lt;code&gt;trigger_rule='one_failed'&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.bash_operator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;airflow&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;depends_on_past&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;example_alerting&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello World from Task 1&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trigger_rule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;one_failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;task_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello World from Task 2&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trigger_rule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;one_failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;task_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello World from Task 3&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trigger_rule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;one_failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;task_4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;echo &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello World from Task 4&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trigger_rule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;one_failed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;task_1&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;task_2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;task_4&lt;/span&gt;
    &lt;span class="o"&gt;--------------------------------------------------------------&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion:
&lt;/h3&gt;

&lt;p&gt;Apache Airflow is an awesome tool to create and manage data pipelines, thanks for the AkumenIA tool to introduce such great tool to me. In the next blog about Airflow, we are going to see how to use it in the cloud with Kubernetes (AKS), and how to monitor its cluster using the ELK. Thank you for reading.&lt;/p&gt;

</description>
      <category>airflow</category>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>apacheairflow</category>
    </item>
    <item>
      <title>Terraform 101 - Part 3/3: Modules, Built-in Functions, Type Constraints, and Dynamic Blocks | By Ali KHYAR</title>
      <dc:creator>Ali KHYAR</dc:creator>
      <pubDate>Sat, 08 Oct 2022 02:32:09 +0000</pubDate>
      <link>https://dev.to/alikhyar/terraform-101-part-33-modules-built-in-functions-type-constraints-and-dynamic-blocks-by-ali-khyar-58i3</link>
      <guid>https://dev.to/alikhyar/terraform-101-part-33-modules-built-in-functions-type-constraints-and-dynamic-blocks-by-ali-khyar-58i3</guid>
      <description>&lt;p&gt;Part 1/3: History, Workflow, and Resource Addressing: &lt;a href="https://dev.to/alikhyar/terraform-101-part-13-history-workflow-and-resource-addressing-by-ali-khyar-4m23"&gt;link here&lt;/a&gt;&lt;br&gt;
Part 2/3: State, Variables, Outputs, and provision: &lt;a href="https://dev.to/alikhyar/terraform-101-part-23-state-variables-outputs-and-provisioners-by-ali-khyar-lno"&gt;link here&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Modules:
&lt;/h2&gt;

&lt;p&gt;simply, and without complexity, a module is a container of many resources that are used together to help avoid reinventing the wheel. Modules can take inputs (optional) and returns outputs(optional).&lt;br&gt;
One module you interacted with is the root module which embodies the code files from the main working directory. When calling modules from another one, the called modules are considered children modules.&lt;br&gt;
Modules can be downloaded and be called from:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Terraform public registry: which contains collections of publicly available modules, that get downloaded (when referencing them) into a hidden folder on your local system.&lt;/li&gt;
&lt;li&gt;Private registry: you probably will go with this for closed source code or security reasons.&lt;/li&gt;
&lt;li&gt;Local system: when you have modules folders saved on your local system, either in the configuration code folder or elsewhere, and then reference them using an absolute or relative path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's look at the snippet below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HCKYe6zM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c4nnu9jjg5xsoe2hog92.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HCKYe6zM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c4nnu9jjg5xsoe2hog92.png" alt="Image description" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Defining modules requires using a reserved keyword which is module followed by the name of that module, in the example above is vpc_module, the main two parameters that should be inside every module are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source: path to module folder.&lt;/li&gt;
&lt;li&gt;version: a best practice to always keep track of the module's version, so you can avoid any unwanted side effects when deploying/redeploying the resource.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other allowed parameters in modules are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;built-in functions&lt;/strong&gt; like max, count, tolist, for_each…. which we will discuss later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;providers:&lt;/strong&gt; which bind the module to a certain provider.&lt;/li&gt;
&lt;li&gt;**depends_on: **setup dependencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As mentioned before, modules can optionally take inputs and return outputs, which are defined in the modules output block, and can be referenced like in the snippet below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3MQvfAQL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a560fkfav1jvh7dby4lm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3MQvfAQL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a560fkfav1jvh7dby4lm.png" alt="Image description" width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking at the snippet above, you should know that an output named &lt;strong&gt;subnet_id&lt;/strong&gt; is defined inside &lt;strong&gt;vpc-module&lt;/strong&gt; module:&lt;br&gt;
When referencing module outputs we always start with module keyword.&lt;br&gt;
In the already seen snippet:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9o5yx5-j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/arpnimytwbxs1x5nmat5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9o5yx5-j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/arpnimytwbxs1x5nmat5.png" alt="Image description" width="800" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;we have &lt;strong&gt;region&lt;/strong&gt; which is considered as input for this module, it is arbitrarily named because we only define it in module resource to call it later in the module's code using the following syntax: var.region&lt;/p&gt;

&lt;h2&gt;
  
  
  Built-In Functions:
&lt;/h2&gt;

&lt;p&gt;Bilt-in functions are expressions that allow you to get a value from somewhere, transform it, evaluate it or convert it. Users cannot define custom functions but the list of the already defined functions is extensive.&lt;br&gt;
Calling built-in functions in Terraform is like calling functions in any programming language: &lt;strong&gt;funcName(arg1, arg2, … )&lt;/strong&gt;, let's take a look at the &lt;strong&gt;join&lt;/strong&gt; function which produces a string by concatenating together all elements of a given list of strings with the given delimiter:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FHJBnu9G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tl7xvh4o2m52ntmob7os.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FHJBnu9G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tl7xvh4o2m52ntmob7os.png" alt="Image description" width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The delimiter in the above snippet is a hyphen - and the element between brackets are strings that we will concatenate, which will result in the string &lt;strong&gt;my-project-name-preprod&lt;/strong&gt; .&lt;br&gt;
Terraform happily provides a console to test things such as built-in functions and expressions, to access it use the command:&lt;code&gt;terraform console&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Type Constraints:
&lt;/h2&gt;

&lt;p&gt;So far, we have seen primitive types like string, number, and boolean, which controls the type of given variable values.&lt;br&gt;
Another type of variable is the Complex type, which is created by combining multiple types, example of complex types are list, tuple, map, and object. The complex type itself can be divided into two types;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collection: multiple values of one primitive type grouped together against a variable, example:
&lt;code&gt;list(type)
map(type)
set(type)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y1Beg5Yy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/anzyxwy28eq8fuhbtvii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y1Beg5Yy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/anzyxwy28eq8fuhbtvii.png" alt="Image description" width="800" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structural: multiple values of different primitive types grouped together.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1ufZd2xy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8rspkvbfbq57yeg5ybkx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1ufZd2xy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8rspkvbfbq57yeg5ybkx.png" alt="Image description" width="800" height="397"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another constraint type is the &lt;code&gt;any&lt;/code&gt; constraint which serves as a placeholder for a primitive type not decided yet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--K5_3UQSO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1xihzise4utbtpczk2f7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--K5_3UQSO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1xihzise4utbtpczk2f7.png" alt="Image description" width="800" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;terraform does its best to know which primitive type to assign to any, in the example above it will assign the string type to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dynamic Blocks:
&lt;/h2&gt;

&lt;p&gt;Dynamic blocks allow the construction of repeatable nested configuration blocks inside the following Terraform blocks: resource, data, provisioner, and provider.&lt;br&gt;
imagine the following scenario, where you need to create a security group that contains many rules:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9PDARCex--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pfd3pi6xa9agnoj0fp3i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9PDARCex--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pfd3pi6xa9agnoj0fp3i.png" alt="Image description" width="800" height="650"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;as the ingress rules add up, the security group will be hard to manage, and the code doesn't look beautiful as well, a way to clean the above code is by using the dynamic blocks. First, we can extract the data from ingress blocks into one variable that looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--EaffHGMh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/x7rowsl45hon7vomk5al.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--EaffHGMh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/x7rowsl45hon7vomk5al.png" alt="Image description" width="800" height="716"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;then by using the following snippet:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--d7seK7Wb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/51rrm8gnjhdjlj9h6dce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--d7seK7Wb--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/51rrm8gnjhdjlj9h6dce.png" alt="Image description" width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;we tell Terraform using the &lt;strong&gt;dynamic&lt;/strong&gt; keyword what block we want to replicate, which is in this case &lt;strong&gt;ingress&lt;/strong&gt;. then we assign our variable to loop through; and for &lt;strong&gt;content&lt;/strong&gt;, Terraform implicitly provides an &lt;strong&gt;ingress&lt;/strong&gt; object which we access its values with the &lt;strong&gt;value&lt;/strong&gt; keyword. The object name matches the dynamic argument &lt;strong&gt;ingress&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion:
&lt;/h2&gt;

&lt;p&gt;Hope this blog gave you an idea of how modules, built-in functions, type constraints, and dynamic blocks work. In the next article, we're going to look at some hacks and tricks that can be used in Terraform.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Terraform 101 - Part 2/3: State, Variables, Outputs, and provisioners | By Ali KHYAR</title>
      <dc:creator>Ali KHYAR</dc:creator>
      <pubDate>Sat, 08 Oct 2022 00:41:24 +0000</pubDate>
      <link>https://dev.to/alikhyar/terraform-101-part-23-state-variables-outputs-and-provisioners-by-ali-khyar-lno</link>
      <guid>https://dev.to/alikhyar/terraform-101-part-23-state-variables-outputs-and-provisioners-by-ali-khyar-lno</guid>
      <description>&lt;p&gt;Part 1/3: History, Workflow, and Resource Addressing: &lt;a href="https://dev.to/alikhyar/terraform-101-part-13-history-workflow-and-resource-addressing-by-ali-khyar-4m23"&gt;link here&lt;br&gt;
&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  State:
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Concepts and local storage:
&lt;/h2&gt;

&lt;p&gt;State in Terraform is the mechanism with which Terraform acts the way it does, it's what helps it map real-world resources to your configuration. why it is essential? Because with it Terraform can track which resources are deployed, so that next time when we try to deploy a new configuration code, Terraform decides which resources need to be created, updated, or destroyed, this is done by comparing the state file with the configuration code.&lt;/p&gt;

&lt;p&gt;Terraform state is tracked through a flat file named by default &lt;code&gt;terraform.tfstate&lt;/code&gt;, a JSON dump that contains metadata and data about deployed resources. If no backend is specified in the configuration code, the state will be stored locally, but for better practices, the state should be stored remotely for better integrity and availability across teams.&lt;br&gt;
As recommended to store Terraform state remotely, it is also recommended to not lose it because you have no way to know what resources got deployed previously. You wouldn't like as well the state file to fall into wrong hands, because it may contain sensitive data about your resources.&lt;/p&gt;

&lt;p&gt;Terraform state has 3 common sub-commands:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--B36swHxc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/98sljijbnf2dau1k08d2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--B36swHxc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/98sljijbnf2dau1k08d2.png" alt="Image description" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first command is used to list tracked resources by Terraform state. The second one is used to show details of a tracked resource. The last command is used to remove resources from the state file so they won't be tracked anymore.&lt;/p&gt;

&lt;p&gt;Let's see the configuration below which provisions a docker image resource and spin up a container of that image locally:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--z2c2ErP2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/x84pz2lh2hd372p8034s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--z2c2ErP2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/x84pz2lh2hd372p8034s.png" alt="Image description" width="630" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First, we will initialize the state, by using &lt;code&gt;terraform init&lt;/code&gt; a command which will create a folder named &lt;code&gt;.terraform&lt;/code&gt; which is a local cache where Terraform retains some files it will need for subsequent operations against this configuration (providers …), &lt;code&gt;terraform init&lt;/code&gt; also creates &lt;code&gt;.terraform.lock.hcl&lt;/code&gt;, a dependency lock file that gets created or updated whenever &lt;code&gt;terraform init&lt;/code&gt; gets run.&lt;br&gt;
After initializing the working directory we can run &lt;code&gt;terraform plan&lt;/code&gt; (not mandatory) to see what resources will get deployed, then run &lt;code&gt;terraform apply&lt;/code&gt;to deploy the actual resources. Running &lt;code&gt;terraform apply&lt;/code&gt; will create a state file &lt;code&gt;namedterraform.tfstate&lt;/code&gt;which keeps track of managed resources.&lt;/p&gt;

&lt;p&gt;Running &lt;code&gt;terraform state list&lt;/code&gt; will show tracked resources which in this case will return:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--G-sZZRMZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/urox2xez67ue3g3unodk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--G-sZZRMZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/urox2xez67ue3g3unodk.png" alt="Image description" width="665" height="139"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;which are the two resources we deployed. We can see more information about the state of each resource by running &lt;code&gt;terraform state show &amp;lt;resource_type.resource_name&amp;gt;&lt;/code&gt; that will return in the case of the docker image:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--p7x45zaC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/psnive4im1o5onozx05z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--p7x45zaC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/psnive4im1o5onozx05z.png" alt="Image description" width="800" height="152"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's remove the docker image resource from being tracked by using &lt;code&gt;terraform state rm docker_image.busybox-image&lt;/code&gt; then destroy the resources with &lt;code&gt;terraform destroy&lt;/code&gt;doing that will remove the image from being tracked, thus it won't be destroyed and the terraform will only destroy the container as the command output shows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_zemPg3---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9l0nk8ntxfzpnkgajnhh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_zemPg3---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9l0nk8ntxfzpnkgajnhh.png" alt="Image description" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  State Storage:
&lt;/h2&gt;

&lt;p&gt;The default behavior of Terraform state is being locally stored, but for better availability, security, and visibility across teams, it's a better practice to store state remotely like in HashiCorp Consul, AWS S3, or Azure Storage Account. Remote state storage allows among many other things sharing outputs to other code elsewhere. You can set up where the state file is stored in the &lt;code&gt;terraform&lt;/code&gt; block, using the &lt;code&gt;backend&lt;/code&gt; attribute. In AWS S3, the configuration will look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Z_va7ynK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6xphc09li8ma9paj5uhf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Z_va7ynK--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6xphc09li8ma9paj5uhf.png" alt="Image description" width="503" height="227"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This assumes we have a bucket created called &lt;code&gt;mybucket&lt;/code&gt;. The Terraform state is written to the key &lt;code&gt;path/to/my/key&lt;/code&gt;.&lt;br&gt;
Using Azure, you can also store the state as a Blob with the given Key within the Blob Container within the Blob Storage Account. Those are some configuration examples:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VXusHvQY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xnamq0cx2u9k1t5m8fbr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VXusHvQY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xnamq0cx2u9k1t5m8fbr.png" alt="Image description" width="775" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Variables:
&lt;/h2&gt;

&lt;p&gt;Variables in Terraform serve as variables serve in programming languages, it is a way of storing data, so that you make configuration code clean and reusable. Variables are declared within Terraform using the following syntax:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ycpkfs9_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/csh84rdon7brb3wxvfty.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ycpkfs9_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/csh84rdon7brb3wxvfty.png" alt="Image description" width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Variables types are two:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;base: string ("anything btw double quotes"), number(15, 0.15..), bool(true, false)&lt;/li&gt;
&lt;li&gt;complex: list(["same", "type"]), set, maps({name = "Mabel", age = 52}), object({ port = number service = string }), tuple
we can define a variable type to be combined with one or more types, for instance:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BmSNErEX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mu1lrugymjmgjlvrfk0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BmSNErEX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mu1lrugymjmgjlvrfk0y.png" alt="Image description" width="780" height="850"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Terraform variables are referenced in configuration code with &lt;code&gt;var.name_of_var&lt;/code&gt;, and read precedence starts with ones passed through the OS environment variables and then the &lt;code&gt;terraform.tfvars&lt;/code&gt; file, then variables in the main configuration code.&lt;br&gt;
Other parameters that are useful in variables declaration are&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validation: useful to find errors while script didn't start running, common example below, to test IP address validation against a &lt;code&gt;regex&lt;/code&gt; expression using the built-in function regex (we will take a look into the built-in function in part 3/3)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--H_XQx7sB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vdegj1xp8c333f1gd1ot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--H_XQx7sB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vdegj1xp8c333f1gd1ot.png" alt="Image description" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sensitive data: often you need to configure your infrastructure using sensitive or secret information such as usernames, passwords, API tokens, or Personally Identifiable Information (PII). When you do so, you need to ensure that this data is not accidentally exposed in CLI output, log output, or source control, a common solution is to set &lt;code&gt;sensitive&lt;/code&gt; flag to be &lt;code&gt;true&lt;/code&gt; within the variable configuration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Outputs:
&lt;/h2&gt;

&lt;p&gt;Output values give you information back to the CLI about deployed resources, they are like &lt;code&gt;return&lt;/code&gt; in programming languages functions. Below is an output that gives back the private IP address of deployed ec2 instance (resource type: aws_instance) named my-ec2:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Fe8SbR0s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0vd58mwrl8spqwhcnq4e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Fe8SbR0s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0vd58mwrl8spqwhcnq4e.png" alt="Image description" width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;output variable values are shown in the CLI after successful &lt;code&gt;terraform apply&lt;/code&gt;. You can still set &lt;code&gt;senstive=true&lt;/code&gt; to outputs, in case they contain sensitive values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Provisioners:
&lt;/h2&gt;

&lt;p&gt;provisioners give users the ability to execute commands and scripts through Terraform resources. You can run those commands/scripts either on the machine where terraform is installed or through the resources that were created with Terraform. Each provisioner is attached to a certain resource, and it has the ability to connect to it (if it needs to) via ssh and such protocols.&lt;/p&gt;

&lt;p&gt;There are two provisioner types: "creation time" and "destroy-time" provisioners which you can set to run when a resource is being created or destroyed.&lt;/p&gt;

&lt;p&gt;Although provisioner looks like a good feature, HashiCorp recommends not using provisioners unless the cloud provider doesn't offer a mechanism that runs commands or scripts through resources. One Con provisioners have, is that they are not tracked by Terraform state.&lt;/p&gt;

&lt;p&gt;Provisioners are recommended to be used if Terraform declarative model doesn't already offer the action to be taken. If while applying configuration the code exits with non-zero code, it's considered failed and the resource is tainted.&lt;/p&gt;

&lt;p&gt;The configuration below runs two provisioners on a null resource:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--S2fyW-jm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rv9tk1xcsqqsfk479ma6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--S2fyW-jm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rv9tk1xcsqqsfk479ma6.png" alt="Image description" width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;one that runs in resource create and another (that contains destroy) that runs when destroying the resource, the two provisioners add 0, and 1 respectively to a file named status.txt. So as you can tell, when first provision the null_resoure there will be in status.txt 0 , and after destroying it, there will be 01 in that file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion:
&lt;/h2&gt;

&lt;p&gt;I hope this blog helps you up in understanding a bit more about terraform state, variables, outputs, and provisioners. In the next blog, I'm going to talk about Terraform modules, built-in functions, and dynamic blocks.&lt;/p&gt;




&lt;p&gt;Part 3/3: Modules, Built-in Functions, Type Constraints, and Dynamic Blocks: link here&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>iac</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Terraform 101 - Part 1/3: History, Workflow, and Resource Addressing | By Ali KHYAR</title>
      <dc:creator>Ali KHYAR</dc:creator>
      <pubDate>Fri, 07 Oct 2022 23:11:00 +0000</pubDate>
      <link>https://dev.to/alikhyar/terraform-101-part-13-history-workflow-and-resource-addressing-by-ali-khyar-4m23</link>
      <guid>https://dev.to/alikhyar/terraform-101-part-13-history-workflow-and-resource-addressing-by-ali-khyar-4m23</guid>
      <description>&lt;h2&gt;
  
  
  About Terraform:
&lt;/h2&gt;

&lt;p&gt;Terraform is an open-source Infrastructure as Code (IaC) software tool, which simply means that it enables you to write resource deployment usually for the cloud in a human-readable way. IaC is one of the better DevOps practices that track infrastructure code and deploy it in a repeatable/predictable manner.&lt;/p&gt;

&lt;p&gt;Back in 2011 when CloudFormation of AWS appeared, one of the creators of terraform saw the need for an open-source, cloud-agnostic tool that is not bound to one cloud provider, and that has the same functionalities as CloudFormation. The idea of terraform appeared in 2011, but the first lines of Golang code weren't written until July 2014, and version 0.1 only had support for AWS and DigitalOceans.&lt;/p&gt;

&lt;p&gt;Terraform uses its own language known as Hashicorp configuration language (HCL), which was created to have both human and machine-friendly syntax, it has a native syntax intended to be pleasant to humans in writing/reading, and it has a JSON based variant that is easier for machines to generate and parse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraform Workflow:
&lt;/h2&gt;

&lt;p&gt;The core terraform workflow has three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;write&lt;/strong&gt;: writing your code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;plan&lt;/strong&gt;: reads the code and preview changes, basically, it makes Terraform mock what the code will apply. you can do any number of iterations between the write and plan phase.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;apply&lt;/strong&gt;: tell Terraform to provision real infrastructure and update the state file&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One other command that you will need to know is: terraform destroy, which looks at recorded, stored state file created during deployment and destroys all resources created, it is a non-reversible command so it should be used with caution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraform Init:
&lt;/h2&gt;

&lt;p&gt;Terraform expects to be invoked from a working directory that contains configuration files written in the Terraform language, and uses configuration content from this directory, and also uses the directory to store settings, caching plugins, and modules, and sometimes state data. Hence, if the working directory wasn't specified we should do so by using the terraform command: &lt;code&gt;terraform init&lt;/code&gt; which is like &lt;code&gt;git init&lt;/code&gt; for terraform, that downloads modules and plugins (I will cover modules in part 2/3), and sets up the backend for storing terraform state file; a mechanism which terraform tracks resources with. Note that if you run a command that relies on initialization without first initializing, the command will fail with an error and explain that you need to run init.&lt;/p&gt;

&lt;p&gt;When initializing the working directory two files appear alongside Terraform configuration files :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;.terraform: a hidden directory, used to manage cached provider plugins and modules, a record of active workspace, and a record of backend configuration.&lt;/li&gt;
&lt;li&gt;State data file, if the configuration uses the default local backend. This is managed by Terraform in a &lt;code&gt;terraform.tfstate&lt;/code&gt; file (if the directory only uses the default workspace) or a &lt;code&gt;terraform.tfstate.d&lt;/code&gt; directory (if the directory uses multiple workspaces).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Terraform Configuration:
&lt;/h2&gt;

&lt;p&gt;A Terraform configuration will always start with like below, which tells Terraform what provider we will interact with and define its config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;provider "aws"{
    region = "us-east-1"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;in the above configuration the word &lt;code&gt;provider&lt;/code&gt; is a reserved keyword that fetches whatever provider is following the keyword which in this case is &lt;code&gt;aws&lt;/code&gt;, and between braces, we have the config parameters which help to define the arguments of the AWS provider. The configuration parameters will vary depending on the used provider.&lt;/p&gt;

&lt;p&gt;- - - - - - -&lt;/p&gt;

&lt;p&gt;The most important thing you'll configure with Terraform is resources. Resources are a component of your infrastructure. It might be some low-level component such as a physical server, virtual machine, or container. Or it can be a higher-level component such as an email provider, DNS record, or database provider. Let's look at the example below which deploys an AWS ec2 instance:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resource "aws_instance" "web" {
    ami           = "ubuntu-focal-20.04-amd64-server"
    instance_type = "t3.micro"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;resource&lt;/code&gt; is a reserved keyword that tells Terraform to consider the block as a resource block, &lt;code&gt;"aws_instance"&lt;/code&gt; is a resource provided by terraform provider, every provider is a plugin that implements resource types, for example, to look at AWS visit &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest"&gt;https://registry.terraform.io/providers/hashicorp/aws/latest&lt;/a&gt;, then &lt;code&gt;"web"&lt;/code&gt; which can be given any arbitrary name by the user. The parameters between braces are resource config arguments, which in this case we only specified the AMI image and the instance type. If you want or need to go crazy with the configuration check &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance"&gt;https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance&lt;/a&gt;.&lt;br&gt;
- - - - - - -&lt;br&gt;
Another block you need to know about in Terraform is the data source block, The main difference between a data source block and a resource block is that a data source block is fetching and tracking details of an already existing resource, whereas a resource block creates a resource from scratch, look at the following example which stores id information about an already deployed VM:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data "aws_instance" "apache-server" {
   instance_id = "some-random-id"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Resource Addressing:
&lt;/h2&gt;

&lt;p&gt;Let's imagine a scenario where you need to call the deployed ec2 instance, you can do that by specifying the resource type which is aws_instance then separated by a dot with the user's arbitrary given name. So in the scenario above, we can reference the resource with &lt;code&gt;aws_instance.web&lt;/code&gt; . The same process goes for data sources blocks.&lt;/p&gt;

&lt;p&gt;It's recommended that if you want to reference a property of a resource inside of the same resource to use self attribute. take a look at the following example in which in line 18 we reference the IP address of the deployed ec2 instance using self keyword:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;- - - - - -&lt;br&gt;
In &lt;a href="https://dev.to/alikhyar/terraform-101-part-23-state-variables-outputs-and-provisioners-by-ali-khyar-lno"&gt;the next blog&lt;/a&gt;, we will talk more about state, variables, provisioners, and modules.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>iac</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
