DEV Community: David Espejo

Understanding Flyte's Dynamic Workflows: How They Power Scalable ML Pipelines

David Espejo — Thu, 08 May 2025 10:37:15 +0000

In this post, I’ll explain how Flyte’s Dynamic Workflows work and what they mean for your ML pipelines.

Flyte Workflows define a pipeline as a Directed Acyclic Graph (DAG). This abstraction allows the representation of complex processes with branches, but without loops. Consequently, each new execution of the DAG must start from the beginning or from an intermediate point, which can impact compute efficiency—more on that in future posts.

Flyte’s workflow lifecycle has two main phases: compile time and run time.

During compile time, Flyte serializes the code, packages it, and uploads it to blob storage, making it accessible to the control plane. This process is called registration. The focus here is on speed and efficiency, primarily by performing type checking without executing the code. Instead of evaluating function inputs, Flyte simply records an “intent to execute.”

In most languages, using async/await returns a coroutine, whereas Flyte returns a Promise object. Flyte uses promises as a temporary mechanism to prevent function evaluation during compile time.

Since the Promise is asynchronous, it must be awaited at runtime to be resolved and executed. This approach enables async evaluation but differs from native Python mechanisms, though that may change in future Flyte versions.

Tasks vs workflows

At compile time, workflows are evaluated, but Tasks are only lazily evaluated because we don’t need to perform computations at that point.

How so? Well, workflows are used to structure tasks, NOT to perform computations, so they can be safely materialized at registration time.

Tasks, on the other hand, DO perform computations, so to enable fast iterations over the DAG, they need to be lazily evaluated at compile time and evaluated ONLY at run time.

Dynamic workflows

They sit in the middle of a Task and a Workflow. They build the structure of tasks (DAG) as a workflow would do, but they are only evaluated at runtime, just like Tasks. They perform the computation and build the DAG both at runtime. Their return object is a Promise, so outputs cannot be accessed directly, but one can write a task that consumes them.

Let's use an example to understand this better:

from flytekit import task, workflow, dynamic, ImageSpec
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

custom_image = ImageSpec(
    packages=["pandas",
              "scikit-learn",
              "pyarrow",],
    registry="localhost:30000"
)

@task(container_image=custom_image)
def preprocess_data() -> list[pd.DataFrame]:
    # Simulate data loading and preprocessing
    data = [{'feature1': [1, 2, 3, 4], 'feature2': [5, 6, 7, 8], 'label': [0, 1, 0, 1]}]
    dfs = [pd.DataFrame(data_dict) for data_dict in data]
    return dfs

@task(container_image=custom_image)
def train_model(data: pd.DataFrame, hyperparameter: float) -> float:
    # Split data into features and labels
    X = data[['feature1', 'feature2']]
    y = data['label']

    # Train a simple logistic regression model
    model = LogisticRegression(C=hyperparameter)
    model.fit(X, y)

    # Predict and calculate accuracy
    predictions = model.predict(X)
    accuracy = accuracy_score(y, predictions)
    return accuracy

@dynamic(container_image=custom_image)
def dynamic_workflow(hyperparameters: list[float]) -> list[float]:
    # Preprocess data
    data = preprocess_data()

    # Use a for loop to train models with different hyperparameters
    accuracies = []
    for hyperparameter in hyperparameters:
        for df in data:
            accuracy = train_model(data=df, hyperparameter=hyperparameter)
            accuracies.append(accuracy)

    return accuracies

@workflow
def ml_pipeline(hyperparameters: list[float]) -> list[float]:
    return dynamic_workflow(hyperparameters=hyperparameters)

In a nutshell, what this workflow does is to take two arbitrary and unknown string inputs and return the number of common characters between the two.

When we run it on a Flyte cluster, we can see that it involves both concurrent and parallel execution, all within the same context (same execution ID). In that regard, it behaves like a workflow, but the fact that inputs are only materialized at runtime makes it look like a task.

It’s also interesting to confirm that the DAG doesn’t build until the tasks have run:

Fun facts:

@dynamic works locally because flytekit treats it as a regular task.
When a dynamic task is executed, it generates the entire DAG as output. You will find it as the futures.pbfile, meaning the workflow is yet to be executed

When to use Dynamic Workflows?

If inputs are only lazily evaluated at registration time, that gives me more flexibility to define them dynamically at runtime
Working with inputs that are unknown from the beginning or are designed to change in response to other system metrics. Hyperparameter optimization is a good example

Combining dynamism and parallelism

If you read my previous post, you may have had a similar reaction to mine with the example I used in this post: why use a for loop when I can use MapTask?

Let's look at the following example that transforms our ML pipeline with a combination of dynamic workflows and MapTasks:

from flytekit import task, workflow, dynamic, map_task, ImageSpec
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

custom_image = ImageSpec(
    packages=["pandas",
              "scikit-learn",
              "pyarrow",],
    registry="localhost:30000"
)

@task(container_image=custom_image)
def preprocess_data() -> list[pd.DataFrame]:
    # Simulate data loading and preprocessing
    data = [{'feature1': [1, 2, 3, 4], 'feature2': [5, 6, 7, 8], 'label': [0, 1, 0, 1]}]
    dfs = [pd.DataFrame(data_dict) for data_dict in data]
    return dfs

@task(container_image=custom_image)
def train_model(data: pd.DataFrame, hyperparameter: float) -> float:
    # Split data into features and labels
    X = data[['feature1', 'feature2']]
    y = data['label']

    # Train a simple logistic regression model
    model = LogisticRegression(C=hyperparameter)
    model.fit(X, y)

    # Predict and calculate accuracy
    predictions = model.predict(X)
    accuracy = accuracy_score(y, predictions)
    return accuracy

@dynamic(container_image=custom_image)
def dynamic_workflow(hyperparameters: list[float]) -> list[float]:
    # Preprocess data
    data = preprocess_data()

    # Use map_task to train models in parallel with different hyperparameters
    accuracies = map_task(train_model)(data=data, hyperparameter=hyperparameters)

    return accuracies

@workflow
def ml_pipeline(hyperparameters: list[float]) -> list[float]:
    return dynamic_workflow(hyperparameters=hyperparameters)

You can run this example in the local Flyte sandbox cluster

The result is a 20% faster execution, thanks to the power of parallel processing:

Conclusion

Modern use cases like LLMs, where the user defines the input at runtime via prompting, demands the level of dynamism Flyte can offer with Dynamic Workflows. This post shared some insights on how they work and how to combine them with parallelism to achieve both flexible and efficient computation.

Learn more about Flyte Dynamic Workflows in the docs

Questions? -> join the Flyte Slack community !

Boost your ML pipeline performance with efficient parallelism

David Espejo — Wed, 09 Apr 2025 11:40:02 +0000

I'll show you different levels of parallelism you can achieve to improve the performance of complex Machine Learning workflows.

Parallelism in ML pipelines

Map is a functional programming paradigm that allows you to apply a function to each element of a collection (e.g., a list) and return the output in a collection of the same type.
This can be very useful in ML pipelines, where multiple transformation steps are applied iteratively to the input dataset. Say you need to preprocess a text dataset represented as a list:

raw_texts = [
    "Hello! This is a SAMPLE text with UPPERCASE words.",
    "  Extra   spaces   AND 123 numbers...  ",
    "Another EXAMPLE with Punctuation!!!"
]

You may want to define a preprocessing function:

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

Then, you need to apply this function to the input text. Doing it sequentially would likely involve writing something like a for loop. While the output is the same, this can be time-consuming for large datasets, and processing it in parallel may be preferable.

Enter the map pattern.

# Apply preprocessing to all elements in the inputs list using map
preprocessed = list(map(preprocess_text, raw_texts))

Using this approach, we can complete preprocessing in parallel, which can be beneficial in ML pipelines where this step may happen multiple times.

But, how to do it more efficiently? How to distribute the load among multiple compute nodes? Enter Flyte Map Tasks.

Flyte MapTask

Flyte is a distributed computation framework that uses a Kubernetes Pod as the fundamental execution environment for each task in a pipeline. When you use MapTasks, Flyte automatically distributes the load among multiple Pods that run in parallel and limits each Pod to downloading and processing only a specific index from the inputs list, preventing inefficient duplicate data movement.

But how many Pods are created?

Let’s see.

In this simple example:

from flytekit import map_task, task, workflow

threshold = 11


@task
def detect_anomalies(data_point: int) -> bool:
    return data_point > threshold


@workflow
def map_workflow(data: list[int] = [10, 12, 11, 10, 13, 12, 100, 11, 12, 10]) -> list[bool]:
    # Use the map task to apply the anomaly detection function to each data point
    return map_task(detect_anomalies)(data_point=data)

We can see it creates one Pod per element in the inputs list:

We said it’s an iterative process, so next time you have to do this, MapTasks again spin up 10 Pods. What if there are thousands or millions of inputs (like in many LLM input datasets)? What is the impact of booting up one container per element in the inputs list?

With the example above, this is the time it takes to complete the execution:

20 seconds.
If we duplicate the list size, time starts piling up:

This is, for a 100% increase in the input dataset size, there was a 50% increase in total execution time.

What if you could reuse a specific number of Pods to mitigate this penalty?

Union Actors

This feature allows you to declare an execution environment and then reuse it through multiple executions to mitigate the impact of container boot-up times.

We define a slightly modified version of the previous example to run it in the Union platform using an input dataset with 100 items.
As you may imagine at this point, it would create 100 Pods.

To mitigate this workflow becoming a "noisy neighbour" in your cluster, you can specify the number of concurrent executions and, hence, the number of Pods that are created simultaneously at any point in time. In this example, we limit this number to 10:

from flytekit import map_task
import union
import random
threshold = 31

#Declare a container image
image = union.ImageSpec(
    packages=["union==0.1.168"],
    builder= "union",
)

@union.task(container_image=image)
def detect_anomalies(data_point: int) -> bool:
    return data_point > threshold

@union.workflow
def map_workflow(data: list[int] = random.sample(range(1,101), k=100)) -> list[bool]:
    # Use the map task to apply the anomaly detection function to each data point
    return map_task(detect_anomalies, concurrency=10)(data_point=data)

Execution takes 1 minute:

Now, if we define an Actors environment with 10 replicas, or 10 "reusable" Pods:

from flytekit import map_task
import union
import random

threshold = 101

image = union.ImageSpec(
    #registry="ghcr.io/davidmirror-ops/images",
    packages=["union==0.1.168"],
    builder= "union",
)
#Here we define the Actors settings
actor = union.ActorEnvironment(
    name="my-actor",
    replica_count=10,
    ttl_seconds=30,
    requests=union.Resources(
        cpu="125m",
        mem="256Mi",
    ),
    container_image=image,
)

@actor.task
def detect_anomalies(data_point: int) -> bool:
    return data_point > threshold


@union.workflow
def map_workflow(data: list[int] = random.sample(range(1,101), k=100)) -> list[bool]:
    # Use the map task to apply the anomaly detection function to each data point
    return map_task(detect_anomalies)(data_point=data)

Execution time is 22 seconds:

That's a 63% decrease in execution time, enabling much higher iteration velocity and more efficient resource consumption.

Conclusion

Modern ML workloads are typically designed to process in parallel big datasets. That carries a performance penalty due to the latency of booting up a container. Flyte let's you achieve efficient parallel processing but Union Actors take it to the next level by also removing the limitation of one Pod per input item, allowing faster and more efficient executions.

Signup for Union's free tier
Check out the repo
Questions? Join us on Slack!