One of the more difficult tasks within the Data Science community is not designing a model to a well-constructed business problem or developing the code-base to operate in a scalable environment, but, rather, arranging the tasks in the ETL, or in the Data Science pipeline, executing the model on a periodic basis and automating everything in-between.
This is where Apache Airflow comes to the rescue! With the Airflow UI to display the task in a graph form, and with the ability to programmatically define your workflow to increase traceability, it is much easier to define and configure your Data Science workflow in production.
One difficulty still remains, though. There are circumstances when the same modelling, monolithic, process is utilized and applied to different data sources. To increase performance, it is better to have each of these processes run concurrently, rather than add them to the same dag.
No problem, let us simply create a dag for each process, all with similar tasks, and schedule them to run at the same time. If we were to follow along software development principle DRY, is there a way to create multiple different dags with the same-type tasks without having to manually create them?
Is there a way to create multiple different dags with the same-type tasks without having to manually create them?
....
To read understand and read more regarding how to scale your Apache-Airflow DAGS, please continue reading here: https://towardsdatascience.com/scaling-dag-creation-with-apache-airflow-a7b34ba486ac
Top comments (0)