In case you weren't already a fan of the 1986 Transformer movie, Unicron was a giant, planet sized robot, also known as the God of Chaos.
For me this analogy is too obvious. DAG schedulers like Airflow (cron's), often become bloated fragile monoliths (uni-cron's). And just like this planet eating monster, they bring to all sorts of chaos in for the engineers that maintain and operate them.
There have been quite a few great articles written on the subject of breaking up the Airflow monorepo, and to provide context I'll cover these quickly. However, this approach alone does not defeat Unicron. In this world of increasingly decentralized data development we need to seriously think about using just a single scheduler.
Airflow to often reaches limits of project dependencies, multi-team collaboration, scalability, and just overall complexity. It's not Airflows fault, it's the way we use it. Luckily there are a couple great approaches to solving these issues:
1) Use multiple project repos - Airflow will deploy any dag you put in front of it. So you can with a little bit of effort build a deployment pipeline which can deploy dag pipelines from separate project specific repos into a single Airflow. There are a few techniques here ranging from DAGFactory (good article here), leveraging Git sub-modules, to just programmatically moving files around in your pipeline.
2) Containerize your code - reduce the complexity of your Airflow project by packaging your code in separate containerized repositories. Then use the pod operator to execute these processes. Ideally Airflow becomes a pure orchestrator, with very simple product dependencies.
Using both of these techniques, especially in combination will help make your Unicron less formidable, perhaps only moon sized. In fact, in many organizations, this approach coupled with a managed Airflow environment such as AWS Managed Workflows is a really great sweet spot.
As organizations grow, and data responsibilities become more federated, we need to ask ourselves an important question - do we really need a single scheduler?. I would wholeheartedly say no, in fact it becomes a liability.
The most obvious problem - single point of failure. Having a single scheduler, even with resiliency measures, is dangerous. An infrastructure failure, or even a bad deployment could cause an outage for all teams. In modern architecture we avoid SPF's if at all possible so why create one if we don't need to?
Another issue is the excessive multi-team collaboration on a single project. Possible, especially if we mitigate with the techniques above but not ideal. You might still run into dependency issues, and of course Git conflicts.
And then the most obvious question - what is the benefit? In my experience the majority of DAG's in organization are self contained. In other words they are not using cross DAG dependencies via External Task Sensors. And if they are, there is a good chance the upstream data product is owned and maintained by another team. So other than observing whether it is done or not, there is little utility to being in the same environment.
My recommendation is to have multiple Airflow environments, either at the team or application level.
My secret sauce (well one way to accomplish this) - implement a lightweight messaging layer to communicate dependencies between the multiple Airflow environments. The implementation details can vary - but here is a quick and simple approach:
- At the end of each DAG publish to an SNS topic.
- Dependent DAGS subcribe via SQS.
- The first step in the dependent DAG would then be a simple poller function (similar to an External Task Sensor), that would simply iterate and sleep until a message is received.
Obviously the implementation details are maleable and SQS could be substituted with Dynamo, Redis, or any other resilient way to notify and exchange information.
You could even have your poller run against the API of other Airflow instances. Although it will possibly couple you to another projects implementation details (i.e. specific Airflow infrastructure and DAG vs data product). Perhaps that other team might change the DAG that builds a specific product, or replace Airflow with Prefect or maybe move to Step Functions. In general we want to design in a way that components can evolve independently, i.e. loose coupling.
One of my very first implementations of this concept was a simple client library named Batchy, backed by Redis and later Dynamo. I created this long before Data Mesh was a thing, but was guided by the same pain points highlighted above. This simple system has been in place for years integrating multiple scheduler instances (primarily Rundeck) with little complaint and great benefit.