Aligning Timeouts in Distributed Orchestration: Why Equal Airflow and Spark Limits Lead to Race Conditions

#dataengineering #apacheairflow #apachespark #dataplatform

Recently, I reviewed an Airflow DAG where each task submits a single Spark job. I found this configuration:

execution_timeout_minutes: 60 
spark-job-timeout-minutes: 60

At first glance, it looks redundant. Two timeouts, same value. Why do both exist? The answer reveals something important about how Airflow and Spark interact.

Different Layers, Different Clocks

execution_timeout_minutes: is the Airflow task timeout. Its clock starts when the task enters the running state and covers everything: job submission to the cluster, queue wait time, Spark execution, status polling, and cleanup.
spark-job-timeout-minutes: is the timeout applied only to the Spark processing running in the cluster. It basically says: "if this application runs longer than X, abort it". In other words: it does not include submission overhead, queueing time, or the processing the Airflow task performs before or after Spark execution.

Key Takeaway
These are two different clocks measuring two different things, and the Airflow clock starts ticking before the Spark application even exists.

The Problem with Setting Them Equal

With a 60/60 configuration, which timeout triggers first becomes timing-dependent. And because Airflow starts counting earlier, it tends to hit its timeout first in practice.

That is the worst-case scenario: Airflow terminates the task before Spark shuts down properly. Depending on the integration being used, the Spark job may continue running in the cluster orphaned, consuming resources until someone notices. Orphaned jobs are one of the biggest hidden cost drivers in shared clusters: they consume CPU, memory, and sometimes even autoscale nodes long after the orchestrator has given up.

The desired behavior is the opposite: Spark should hit its own timeout first, fail cleanly, and allow the Airflow task to receive that failure within its own execution window. In distributed systems, the layer responsible for the actual processing should ideally detect and terminate problematic execution first.

A Practical Rule
execution_timeout_minutes > spark-job-timeout-minutes

The gap between them must absorb submission time, queueing, polling, and cleanup: components that typically add a few minutes even for small jobs.

Since this overhead tends to vary little within the same environment, think in absolute time, not percentages:

Warm, fixed cluster: +5 min
Livy/REST submission with moderate queueing: +10 min
Ephemeral clusters (EMR on-demand, Databricks job clusters): +15 to 20 min

The Adjustment

Looking at the execution history, this DAG usually completed in 4 to 5 minutes. The original 60-minute limits were simply inherited defensive defaults nobody had revisited.

I reduced them to:

spark-job-timeout-minutes: 15 
execution_timeout_minutes: 20

This is roughly three to four times the observed average runtime — enough to absorb normal variance and occasional spikes without masking real hangs.

Inflated timeouts do not protect anything: they only delay alerts when something is genuinely stuck.

Final Thoughts

Timeouts are not arbitrary numbers. Each exists at a different layer (the orchestrator and the execution engine) with different responsibilities.
When they are aligned correctly (the orchestrator having some margin over the execution engine), failures become predictable.
When they are equal, you create a race that hides real problems.
And the correct value is rarely the one someone set two years ago and never reviewed again.
Timeouts are not safety nets: they are alarms. And alarms only work when they ring at the right time.

If you enjoyed this insight on Data Platform Engineering, feel free to connect with me on LinkedIn for more discussions on data architecture and orchestration.