Meta Article
Curated Info for the Airflow 3 Fundamentals Certification
If you’re looking to get up to speed on Airflow 3 or master the essentials while earning a certification, the Airflow Fundamentals Certification is a solid option. That said, the existing study guides can be outdated or over-scoped. This article is an updated cheatsheet, validated for correctness and curated for the Airflow topics that directly helped answer questions on the certification.
My Background
I’m a Software Engineer who used Airflow to complete the data-engineering-zoomcamp by Datatalks.club and build my Fitbit ETL pipeline.
Thanks for reading! Subscribe for free to receive new posts and support my work.
I aced the Airflow 3 Fundamentals Certification after completing the Astronomer Airflow 3 Learning Path and watching Marc Lamberti’s live Airflow 3 Crash Course.
You can learn more about me on my GitHub.
Material in BOLD
Pay close attention to what’s in BOLD. The info in bold was specifically asked about in the exam, often word-for-word.
Topics that are not bold are indirectly relevant and often necessary background for identifying problems or potential solutions.
The Cheatsheet
Airflow Architecture and Purpose
- DAG Parser, API Server, Scheduler, Executor, Worker (ref)
Life of a DAG & Task
-
DAG Parser
- parses DAGs in the DAGs folder every 5 minutes
- serializes DAGs into the Metadata DB
-
Scheduler
- reads DAGs and state from the Metadata DB
- schedules Task Instances (ref) with the Executor
-
Executor (component of the Scheduler) (Executors)
- pushes Task Instances to the queue
-
Worker
- picks up Task Instances from the queue
- updates Task Instance status through the API Server
- Previously, Airflow 2 workers updated the metadata DB directly.
- executes the Task Instance
Key Properties
The Airflow 2 Webserver is now the Airflow 3 API Server.
The default time zone for Airflow is UTC (Coordinated Universal Time).
Refresh interval to display NEW DAGs is
dag_dir_list_intervaland defaults to 5 minutes (ref).Refresh interval to update MODIFIED DAGs is
min_file_process_intervaland defaults to 30 seconds (ref).“The Executor determines where and how tasks are run.”
TransferOperatormoves or copies data.
CLI
-
airflow db init(ref)- initializes the metadata database
airflow users create(ref)-
- initializes the DB and starts the API server and scheduler
-
airflow info(ref)-
prints Airflow environment info:
- providers installed + provider versions
- paths
- tools
- system info (Python version, OS)
-
-
airflow cheat-sheet(ref)- quick reference for common commands
-
airflow * export(ref)- paired with
airflow * import - exports
connections,pools,users,variables - note: environment variables are NOT exported/imported
- paired with
-
airflow tasks test <dag_id> <task_id> <logical_date>(ref)- runs a single task without checking dependencies or recording its state in the database
-
airflow dags backfill --start-date <START_DATE> --end-date <END_DATE> <DAG_ID>
Airflow Connections (ref)
conn_id — the required unique ID for a connection
parameters — specific to the connection (login, password, hostname, compute_type, etc)
Stored encrypted by default
Connection Creation Options
- UI, CLI, environment variables, API Server REST API, Python code, Secrets Backend
Connections Created from Environment
-
environment connection variables start with
AIRFLOW_CONN_...(Env connections)-
append a custom unique
conn_idto the end.
-
append a custom unique
Know the URI format (ref)
AIRFLOW_CONN_MY_HTTP=my-conn-type://login:password@host:port/schema?param1=val1¶m2=val2
-
Connections created via environment variables have special visibility (ref)
- NOT stored in the metadata DB
- NOT shown in the UI
- YES — still accessible from tasks
Airflow Variables (ref)
JSON key–value store
-
composed of:
- Unique Key/ID
- Value (JSON-serializable)
- Description (optional)
-
Use cases:
- API URLs & keys
- values that change across environments (dev, staging, prod)
example usage:
Variable.get(”api”, deserialize_json=True)
Creation Options
Airflow REST API, Airflow CLI, Python inside Airflow (not advised), Airflow UI, environment variables
-
Creation via environment variable:
AIRFLOW_VAR_...AIRFLOW_VAR_MY_VAR=’{”my_params”: [1,2,3,4]}’
Certain Keywords will hide a Variable from the UI & Logs
- variables containing certain keywords (
access_token,api_key,password, etc.) are hidden from the UI & Logs (ref).
DAG Setup
-
dag_idis the only REQUIRED parameter.
Valid DAG declaration syntax (ref)
from airflow.model import DAG
dag = DAG(...)
PythonOperator(dag=dag, ...)
from airflow.sdk import DAG
with DAG(...):
PythonOperator(...)
from airflow.sdk import DAG
@dag(...)
def my_dag(...):
@task
def my_task():
...
my_task()
my_dag()
default_args (ref)
Purpose: avoid repetition
It is a dict of default task parameters applied to all tasks in the DAG.
Task-level args override the
default_args.
DAG runs (ref)
Created by the scheduler
-
properties_:_ state, dag_id, logical_date, start_date, end_date, duration, run_id
-
logical_dateis the timestamp associated with the run -
run_idis a timestamp-based identifier -
start_date: earliest logical time from which the Scheduler considers creating runs -
end_date: latest logical time to create runs for
-
-
State transitions: queued → running → [success or failed]
-
A DAG run is
successif its last task succeeds.
-
A DAG run is
-
How many runs will happen when unpausing a DAG with certain parameters & scenarios?
- often asked in the context of backfilling
-
start_date,end_date,schedule, andcatchupare varied -
The behavior around
logical_datechanged from Airflow 2 to 3, so older resources may be outdated.
-
dag_idis the only mandatory parameter,- but it’s good practice to set
description,tags,schedule,start_date,end_date,catchup,default_args,max_active_runsandmax_active_tasks.
- but it’s good practice to set
catchup (DAG parameter)
-
catchup=Falseis the default in Airflow 3- Most recently scheduled and missed DAG runs still execute immediately after unpausing.
acceptable values for schedule (DAG parameter)
None(only manual/API-triggered runs),cron expressions,
datetime.timedeltaobjects,presets:
@once,@hourly,@daily(aka@midnight),@weekly,@monthly,@quarterly,@yearly(ref),@continuous= run as soon as the previous run finishes
XCOMs (ref)
for passing small metadata between tasks
must be JSON-serializable
written to the metadata DB via the API Server
-
XCom
pullrequirements — example:- Task1
pushes a value and key to the Metadata DB -
Task2
pulls the value by key and one of:-
run_id,task_id,dag_id - Usually
key+task_idare sufficient.
-
- Task1
@task
def task1(**context: Context):
val = 10
context[’ti’].xcom_push(key=’my_key’, value=val)
@task
def task2(**context: Context):
val = context[’ti’].xcom_pull(task_ids=’task1’, key=’my_key’)
-
XCom size limits depend on the metadata database (Astronomer ref):
- SQLite = 2GB
- Postgres = 1GB
- MySQL = 64KB
XComs are for passing metadata necessary for the pipeline, not the pipeline’s bulk data
Tasks tested with the
airflow tasks testcommand still store their XComs in the Metadata DB and may need to be cleared manually usingairflow xcom clear.
Task dependency orchestration (Task relationships)
Examples:
task1 >> [task2, task3] >> task4= task1 runs, then task2 & task3 in parallel, then task4task1 << [task2, task3] << task4= reverse dependency notation[t1, t2] >> [t3, t4]— errors
chain
chain([t1, t2], [t3, t4]) # establishes sequential dependencies across lists
Options to Backfill a DAG
- CLI, Airflow UI, REST API call
Sensors (ref)
- checks a condition and waits
poke_intervalseconds before checking again
PythonSensor(
task_id=”waiting_for_condition”,
python_callable=_condition,
poke_interval=60,
timeout=7 * 24 * 60 * 60,
mode=”poke”
)
timeoutandpoke_intervalare specified in seconds-
Default
timeoutis 1 week- Setting a meaningful
timeoutis important because the default can stall a worker
- Setting a meaningful
Sensor modes
-
mode=”poke”is the default-
Live
pokeSensors hold worker control and consume a worker slot.- It’s easy to freeze an entire Airflow instance like this (tasks are scheduled but not started).
-
Use
pokeSensors when thepoke_interval<= 5 minutes.
-
-
mode=”reschedule”- allows workers to do other tasks between checks
-
Sensor Task Instance is put in
up_for_reschedulestate between condition checks.
Airflow Providers
Airflow Providers are third-party packages and integrations.
often include Connections, Operators, Hooks, Python modules, etc
Registry of Providers: https://registry.astronomer.io/
Best practices
-
Always define a meaningful
timeoutparameter.- Default is seven days and can block a DAG.
If
poke_interval≤ 5 minutes, setmode=”poke”.Define a meaningful
poke_interval.
Task lifecycle states (ref)
scheduled: Task Instance created and waiting for a slot
queued: handed to the executor; waiting for a worker
running: executing on a worker
success: finished successfully
failed: finished with error and no retries left (or retries exhausted)
up_for_retry: failed but will retry after
retry_delayup_for_reschedule: Sensor in reschedule mode, sleeping until the next check
deferred: deferrable operator yielded to a trigger; not using a worker slot
skipped: bypassed by branching/short-circuit/trigger rules
upstream_failed: did not run because upstream tasks failed and the trigger rule wasn’t met
DAG Debugging
Deleting a DAG from the UI removes all run history & task instances from the metadata database and temporarily hides the DAG until it is re-parsed. It does not remove the DAG file itself.
-
Always
importusing full paths starting from thedagsfolder.- avoid relative imports.
DAG not showing
Wait for the UI refresh interval for new DAGs:
dag_dir_list_interval(default 5 min).Wait for the UI refresh interval for modified DAGs:
min_file_process_interval(default 30 sec).-
Ensure the
dag_idis unique.-
When two DAGs share the same
dag_id, the one that’s displayed will be random.
-
When two DAGs share the same
Check if the DAG is in
.airflowignore.Airflow only recognizes files with “DAG” and “airflow” inside them as DAGs.
DAG not running
Check that the DAG is unpaused.
Ensure
start_dateis in the past.Confirm
end_dateis in the future.Allow multiple versions to run at the same time if intended.
Check
max_active_runs_per_dag(defaults to 16)Check
max_active_tasks_per_dag(defaults to 16)Set
parallelism(max Task Instances that can run per scheduler; default 32)
Validate Airflow Connections
- Airflow UI → Admin → Connections → enter password → click
TEST
New Changes moving from Airflow 2 to 3
start_date=Noneis acceptable and now the default-
Logical date is when the DAG starts running:
- runs immediately; doesn’t wait for the interval to end
airflow db initinitializes the Metadata DB.catchup=Falseby defaultAirflow 2 Webserver is now the Airflow 3 API Server.
-
When
CREATE_CRON_DATA_INTERVALS=True, DAG scheduling behaves like Airflow 2.- Airflow 2: DAGs execute after the interval ends.
- Airflow 3: DAGs execute at the start of the interval.
schedule_intervalis now namedschedule(Scheduling API).
Certification Topics NOT covered here
-
Identify the most helpful Airflow UI view for real-world scenarios:
Given a specific scenario, identify if Airflow is an applicable solution.
cron expressions
Final Certification Tips
If it’s a multi-select problem, always select more than one box.
Currently, the industry is migrating from Airflow 2 to 3, so the differences in their version they were highlighted in this Certification. They may not in the future.
Good luck → Certification Exam: Apache Airflow 3 Fundamentals
Top comments (0)