DEV Community

Michael Salata
Michael Salata

Posted on

The NoFluff Cheatsheet for the Airflow 3 Fundamentals

Meta Article

Curated Info for the Airflow 3 Fundamentals Certification

If you’re looking to get up to speed on Airflow 3 or master the essentials while earning a certification, the Airflow Fundamentals Certification is a solid option. That said, the existing study guides can be outdated or over-scoped. This article is an updated cheatsheet, validated for correctness and curated for the Airflow topics that directly helped answer questions on the certification.

My Background

I’m a Software Engineer who used Airflow to complete the data-engineering-zoomcamp by Datatalks.club and build my Fitbit ETL pipeline.

Thanks for reading! Subscribe for free to receive new posts and support my work.

I aced the Airflow 3 Fundamentals Certification after completing the Astronomer Airflow 3 Learning Path and watching Marc Lamberti’s live Airflow 3 Crash Course.

You can learn more about me on my GitHub.

Material in BOLD

Pay close attention to what’s in BOLD. The info in bold was specifically asked about in the exam, often word-for-word.

Topics that are not bold are indirectly relevant and often necessary background for identifying problems or potential solutions.


The Cheatsheet

Airflow Architecture and Purpose

  • DAG Parser, API Server, Scheduler, Executor, Worker (ref)

Life of a DAG & Task

  1. DAG Parser

    • parses DAGs in the DAGs folder every 5 minutes
    • serializes DAGs into the Metadata DB
  2. Scheduler

    1. reads DAGs and state from the Metadata DB
    2. schedules Task Instances (ref) with the Executor
  3. Executor (component of the Scheduler) (Executors)

    1. pushes Task Instances to the queue
  4. Worker

    1. picks up Task Instances from the queue
    2. updates Task Instance status through the API Server
    3. Previously, Airflow 2 workers updated the metadata DB directly.
    4. executes the Task Instance

Key Properties

  • The Airflow 2 Webserver is now the Airflow 3 API Server.

  • The default time zone for Airflow is UTC (Coordinated Universal Time).

  • Refresh interval to display NEW DAGs is dag_dir_list_interval and defaults to 5 minutes (ref).

  • Refresh interval to update MODIFIED DAGs is min_file_process_interval and defaults to 30 seconds (ref).

  • “The Executor determines where and how tasks are run.”

  • TransferOperator moves or copies data.

CLI

  • airflow db init (ref)

    • initializes the metadata database
  • airflow users create (ref)

  • airflow standalone (ref ref2)

    • initializes the DB and starts the API server and scheduler
  • airflow info (ref)

    • prints Airflow environment info:

      • providers installed + provider versions
      • paths
      • tools
      • system info (Python version, OS)
  • airflow cheat-sheet (ref)

    • quick reference for common commands
  • airflow * export (ref)

    • paired with airflow * import
    • exports connections, pools, users, variables
    • note: environment variables are NOT exported/imported
  • airflow tasks test <dag_id> <task_id> <logical_date> (ref)

    • runs a single task without checking dependencies or recording its state in the database
  • airflow dags backfill --start-date <START_DATE> --end-date <END_DATE> <DAG_ID>

Airflow Connections (ref)

  • conn_id — the required unique ID for a connection

  • parameters — specific to the connection (login, password, hostname, compute_type, etc)

  • Stored encrypted by default

Connection Creation Options

  • UI, CLI, environment variables, API Server REST API, Python code, Secrets Backend

Connections Created from Environment

  • environment connection variables start with AIRFLOW_CONN_... (Env connections)

    • append a custom unique conn_id to the end.
  • Know the URI format (ref)

AIRFLOW_CONN_MY_HTTP=my-conn-type://login:password@host:port/schema?param1=val1&param2=val2
Enter fullscreen mode Exit fullscreen mode
  • Connections created via environment variables have special visibility (ref)

    • NOT stored in the metadata DB
    • NOT shown in the UI
    • YES — still accessible from tasks

Airflow Variables (ref)

  • JSON key–value store

  • composed of:

    • Unique Key/ID
    • Value (JSON-serializable)
    • Description (optional)
  • Use cases:

    • API URLs & keys
    • values that change across environments (dev, staging, prod)
  • example usage:

Variable.get(api, deserialize_json=True)
Enter fullscreen mode Exit fullscreen mode

Creation Options

  • Airflow REST API, Airflow CLI, Python inside Airflow (not advised), Airflow UI, environment variables

  • Creation via environment variable:

    • AIRFLOW_VAR_...
    • AIRFLOW_VAR_MY_VAR=’{”my_params”: [1,2,3,4]}’

Certain Keywords will hide a Variable from the UI & Logs

  • variables containing certain keywords (access_token, api_key, password, etc.) are hidden from the UI & Logs (ref).

DAG Setup

  • dag_id is the only REQUIRED parameter.

Valid DAG declaration syntax (ref)

from airflow.model import DAG
dag = DAG(...)
PythonOperator(dag=dag, ...)


from airflow.sdk import DAG
with DAG(...):
    PythonOperator(...)


from airflow.sdk import DAG
@dag(...)
def my_dag(...):
    @task
    def my_task():
        ...
    my_task()

my_dag()
Enter fullscreen mode Exit fullscreen mode

default_args (ref)

  • Purpose: avoid repetition

  • It is a dict of default task parameters applied to all tasks in the DAG.

  • Task-level args override the default_args.

DAG runs (ref)

  • Created by the scheduler

  • properties_:_ state, dag_id, logical_date, start_date, end_date, duration, run_id

    • logical_date is the timestamp associated with the run
    • run_id is a timestamp-based identifier
    • start_date: earliest logical time from which the Scheduler considers creating runs
    • end_date: latest logical time to create runs for
  • State transitions: queued → running → [success or failed]

    • A DAG run is success if its last task succeeds.
  • How many runs will happen when unpausing a DAG with certain parameters & scenarios?

    • often asked in the context of backfilling
    • start_date, end_date, schedule, and catchup are varied
    • The behavior around logical_date changed from Airflow 2 to 3, so older resources may be outdated.
  • dag_id is the only mandatory parameter,

    • but it’s good practice to set description, tags, schedule, start_date, end_date, catchup, default_args, max_active_runs and max_active_tasks.

catchup (DAG parameter)

  • catchup=False is the default in Airflow 3

    • Most recently scheduled and missed DAG runs still execute immediately after unpausing.

acceptable values for schedule (DAG parameter)

  • None (only manual/API-triggered runs),

  • cron expressions,

  • datetime.timedelta objects,

  • presets: @once, @hourly, @daily (aka @midnight), @weekly, @monthly, @quarterly, @yearly (ref),

  • @continuous = run as soon as the previous run finishes

XCOMs (ref)

  • for passing small metadata between tasks

  • must be JSON-serializable

  • written to the metadata DB via the API Server

  • XCom pull requirements — example:

    • Task1 pushes a value and key to the Metadata DB
    • Task2 pulls the value by key and one of:

      • run_id, task_id, dag_id
      • Usually key + task_id are sufficient.
@task
def task1(**context: Context):
    val = 10
    context[ti].xcom_push(key=my_key, value=val)

@task
def task2(**context: Context):
    val = context[ti].xcom_pull(task_ids=task1, key=my_key)
Enter fullscreen mode Exit fullscreen mode
  • XCom size limits depend on the metadata database (Astronomer ref):

    • SQLite = 2GB
    • Postgres = 1GB
    • MySQL = 64KB
  • XComs are for passing metadata necessary for the pipeline, not the pipeline’s bulk data

  • Tasks tested with the airflow tasks test command still store their XComs in the Metadata DB and may need to be cleared manually using airflow xcom clear.

Task dependency orchestration (Task relationships)

Examples:

  • task1 >> [task2, task3] >> task4 = task1 runs, then task2 & task3 in parallel, then task4

  • task1 << [task2, task3] << task4 = reverse dependency notation

  • [t1, t2] >> [t3, t4] — errors

chain

chain([t1, t2], [t3, t4])  # establishes sequential dependencies across lists
Enter fullscreen mode Exit fullscreen mode

Options to Backfill a DAG

  • CLI, Airflow UI, REST API call

Sensors (ref)

  • checks a condition and waits poke_interval seconds before checking again
PythonSensor(
    task_id=waiting_for_condition,
    python_callable=_condition,
    poke_interval=60,
    timeout=7 * 24 * 60 * 60,
    mode=poke
)
Enter fullscreen mode Exit fullscreen mode
  • timeout and poke_interval are specified in seconds

  • Default timeout is 1 week

    • Setting a meaningful timeout is important because the default can stall a worker

Sensor modes

  • mode=”poke” is the default

    • Live poke Sensors hold worker control and consume a worker slot.

      • It’s easy to freeze an entire Airflow instance like this (tasks are scheduled but not started).
      • Use poke Sensors when the poke_interval <= 5 minutes.
  • mode=”reschedule”

    • allows workers to do other tasks between checks
    • Sensor Task Instance is put in up_for_reschedule state between condition checks.

Airflow Providers

  • Airflow Providers are third-party packages and integrations.

  • often include Connections, Operators, Hooks, Python modules, etc

  • Registry of Providers: https://registry.astronomer.io/

Best practices

  • Always define a meaningful timeout parameter.

    • Default is seven days and can block a DAG.
  • If poke_interval ≤ 5 minutes, set mode=”poke”.

  • Define a meaningful poke_interval.

Task lifecycle states (ref)

  • scheduled: Task Instance created and waiting for a slot

  • queued: handed to the executor; waiting for a worker

  • running: executing on a worker

  • success: finished successfully

  • failed: finished with error and no retries left (or retries exhausted)

  • up_for_retry: failed but will retry after retry_delay

  • up_for_reschedule: Sensor in reschedule mode, sleeping until the next check

  • deferred: deferrable operator yielded to a trigger; not using a worker slot

  • skipped: bypassed by branching/short-circuit/trigger rules

  • upstream_failed: did not run because upstream tasks failed and the trigger rule wasn’t met

DAG Debugging

  • Deleting a DAG from the UI removes all run history & task instances from the metadata database and temporarily hides the DAG until it is re-parsed. It does not remove the DAG file itself.

  • Always import using full paths starting from the dags folder.

    • avoid relative imports.

DAG not showing

  • Wait for the UI refresh interval for new DAGs: dag_dir_list_interval (default 5 min).

  • Wait for the UI refresh interval for modified DAGs: min_file_process_interval (default 30 sec).

  • Ensure the dag_id is unique.

    • When two DAGs share the same dag_id, the one that’s displayed will be random.
  • Check if the DAG is in .airflowignore.

  • Airflow only recognizes files with “DAG” and “airflow” inside them as DAGs.

DAG not running

  • Check that the DAG is unpaused.

  • Ensure start_date is in the past.

  • Confirm end_date is in the future.

  • Allow multiple versions to run at the same time if intended.

  • Check max_active_runs_per_dag (defaults to 16)

  • Check max_active_tasks_per_dag (defaults to 16)

  • Set parallelism (max Task Instances that can run per scheduler; default 32)

Validate Airflow Connections

  • Airflow UI → Admin → Connections → enter password → click TEST

New Changes moving from Airflow 2 to 3

  • start_date=None is acceptable and now the default

  • Logical date is when the DAG starts running:

    • runs immediately; doesn’t wait for the interval to end
  • airflow db init initializes the Metadata DB.

  • catchup=False by default

  • Airflow 2 Webserver is now the Airflow 3 API Server.

  • When CREATE_CRON_DATA_INTERVALS=True, DAG scheduling behaves like Airflow 2.

    • Airflow 2: DAGs execute after the interval ends.
    • Airflow 3: DAGs execute at the start of the interval.
  • schedule_interval is now named schedule (Scheduling API).

Certification Topics NOT covered here

Final Certification Tips

  • If it’s a multi-select problem, always select more than one box.

  • Currently, the industry is migrating from Airflow 2 to 3, so the differences in their version they were highlighted in this Certification. They may not in the future.

  • Good luck → Certification Exam: Apache Airflow 3 Fundamentals

Top comments (0)