DEV Community: ZenML

How to improve your experimentation workflows with MLflow Tracking and ZenML

Alex Strick van Linschoten — Thu, 24 Feb 2022 15:41:46 +0000

Most professional or so-called 'citizen' data scientists will be familiar with the scenario that sees you spending a day trying out a dozen different model training configurations in which you experiment with various hyper parameters or perhaps different pre-trained models. As evening falls, you emerge from the haze of experimentation and you ask yourself: which of my experiments offered the best results for the problem I'm trying to solve?

At this point, especially for smaller use cases or where you were unsure if a hunch was worth pursuing and just wanted to try a few things out, you might be left empty-handed, unable to give an answer one way or another beyond some hunch that there was one set of parameters that really performed well, if only you could remember what they were… And if someone asked you to reproduce the steps it took you to create a particular model, would you even be able to do that?

This would be one of those times where it's worth reminding ourselves that data science includes the word 'science', and that we need to be careful around how we track and reason about models. The workflows and practice of machine learning is sufficiently complicated (and often non-deterministic) that we need rigorous ways of ensuring that we really are doing what we think we are doing, and that we can reproduce our work. (It's not for nothing that 'reproducibility' is often paired with 'crisis'.)

There are manual ways that you could use to help address this problem, but they're unlikely to be sufficient. Will your spreadsheet experiment tracker really capture everything you needed to produce a particular model? (Think about how the particular configuration or random split of data is so central to how your model performs.) What you really want is something that will handle all this tracking of data and parameters, in as automatic a way as is possible.

Why use MLflow Tracking?

Enter, MLflow Tracking, part of a wider ecosystem of tooling offered by MLflow to help you train robust and reproducible models. Other commonly-used pieces are the model registry (which stores any model artifacts created during the training process) as well as their flexible suite of plugins and integrations allowing you to deploy the models you create.

MLflow Tracking is what allows you to track all those little parts of your model training workflow. Not only does it hook into an artifact store of your choosing (such as that offered by ZenML), but it offers a really useful UI interface which you can use to inspect pipeline runs and experiments you conduct. If you want to compare the performance or accuracy of several experiments (i.e. pipeline runs), some diagrams and charts are only a few clicks away. This flexible interface goes a really long way to solving some of the problems mentioned earlier.

One really useful feature offered by MLflow Tracking is that of automatic logging. Many commonly-used machine learning libraries (such as scikit-learn, Pytorch, fastai and Tensorflow / Keras) support this. You either call mlflow.autolog() just before your training code, or you use a library-specific version of that (e.g. mlflow.sklearn.autolog()). In this way, MLflow will handle logging metrics, parameters and models without the need for explicit log statements. (Note that you can also include the non-automated logging of whatever custom properties are important for you.

ZenML + MLflow Tracking = 🚀

If you're using ZenML to bring together the various tools in your machine learning stack, you'll probably be eager to use some of this tracking goodness and make your own experiments more robust. ZenML actually already partly supported what MLflow Tracking does in the sense that any artifacts going in or out of the steps of your ZenML pipeline were being tracked, stored and versioned in your artifact and metadata store. (You're welcome!) But until now we didn't have a great way for you to interact with that metadata about your experiments and pipeline runs that was non-programmatic and also visual.

MLflow Tracking gives you that ability to inspect the various experiments and pipeline runs in the (local) web interface and is probably going to be a friendlier way of interacting with and reasoning about your machine learning experiments.

You could have used MLflow Tracking in the past, too, but with our latest integration updates ZenML handles some of the boilerplate complicated setup that comes with using MLflow. There are different ways of deploying the tracking infrastructure and servers and it isn't a completely painless task to set all this up and to get going with MLflow Tracking. This is where we make your life a bit easier: we setup everything you need to use it on your (currently: local) machine, connecting the MLFlow Tracking interface to your ZenML artifact store. It can be a bit tricky to configure the relevant connections between the various modular pieces that talk to each other, and we hide this from you beneath an abstraction.

We think that this ability to converse between the MLflow universe and the ZenML universe is extremely powerful, and this approach is at the heart of what we are trying to build with our tool to help you work with reproducible and robust machine learning pipelines.

Just tell me how to use it already!

The best place to see MLflow Tracking and ZenML being used together in a simple use case is our example that showcases the integration. It builds on the quickstart example, but shows how you can add in MLflow to handle the tracking. In order to enable MLflow to track artifacts inside a particular step, all you need is to decorate the step with @enable_mlflow and then to specify what you want logged within the step. Here you can see how this is employed in a model training step that uses the autolog feature I mentioned above:

# Define the step and enable mlflow - order of decorators is important here
@enable_mlflow
@step
def tf_trainer(
    config: TrainerConfig,
    x_train: np.ndarray,
    y_train: np.ndarray,
) -> tf.keras.Model:
    """Train a neural net from scratch to recognize MNIST digits return our
    model or the learner"""
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(10),
        ]
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(config.lr),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"],
    )

    mlflow.tensorflow.autolog()
    model.fit(
        x_train,
        y_train,
        epochs=config.epochs,
    )

    # write model
    return model

If, for any reason, you need to access the global environment parameters used by ZenML to automatically configure MLflow (which define where and how experiments and runs are displayed and stored in the MLflow Tracking UI/system), we've got you covered. These global parameters can be easily accessed through the Environment singleton object:

zenml.integrations.mlflow.mlflow_environment import MLFLOW_ENVIRONMENT_NAME
mlflow_env = Environment()[MLFLOW_ENVIRONMENT_NAME]

Check out the API docs to learn more about the Environment object and watch this space for a blog post where we explain more about why we chose to add this recently.

Over to you now!

If you're inspired by this illustration of how you can make your machine learning workflow that little bit more reproducible and robust, check out the full example that illustrates the integration. If you use it in your own code base, please do let us know — say hi on Slack! — and as always if you have any questions, we're here for you.

Taking on the ML pipeline challenge: why data scientists need to own their ML workflows in production

Hamza Tahir — Mon, 06 Dec 2021 12:35:02 +0000

This article discusses the benefits of giving data scientists ownership of their workflows in a production environment. We also discuss ZenML, an open-source framework that is built specifically to facilitate this ownership.

Why do we need ML pipelines?

Let’s start with the most obvious question: Why do we need ML pipelines in the first place? Here are some solid reasons:

ML pipelines are the best way to automate ML workflows in a repeatable, robust and reproducible manner.
Organizations can centralize, collaborate, and manage workflows with a standardized interface.
ML pipelines can be seen as a means of coordination between different departments.

However, awesome as ML pipelines are, they do come with some inherent questions which might prove problematic.

The Ownership Dilemma

One question that organizations developing machine learning need to answer is who owns ML pipelines in production? Is it the data scientist who creates the model? Is it the data engineer who deploys it in production? Is it someone else altogether?

Note: For an overview of the roles involved in the ML process, check out this part of UC Berkeley’s Full Stack Deep Learning Course.

The data dimension of ML makes it significantly more complex than other deployments in production. More often than not, data scientists do not have the skill set to reliably guide a model into production. Therefore, it is only natural for teams to organize in such a way that there is a shift of ownership from the training phase of development to the deployment phase in the direction of the engineering department (thrown over the wall).

Organizing this way makes it easy enough to get a model in production once. However, the ownership question really comes into play when things (inevitably) go wrong. This could be as the model/concept drifts, or something times out, or the data format changes, or a million other things that the engineering department would know nothing about as they didn't produce the model.

Also, the farther away you push the data scientist from a production setting, the harder it is to establish a data-flywheel effect of consistent improvement of your models in production. The data scientist needs a wider context in order to be able to make the right calls when training the model. E.g. Should we sacrifice the AUROC a few percent, if that means the model might perform faster in production and yield higher revenue? There is no chance one can even ask that question if just looking at model metrics like AUROC.

Indeed, the further we push the data scientist from production, the less productive the overall ML process will get, and the higher the coordination costs and wait times will get. That’s a no-go for ML, which is supposed to be fast and iterative process.

Solving the Ownership Dilemma

Recently, there has been a lot of talk about how data scientists shouldn’t need to know Kubernetes, or the underlying production infrastructure needs to be necessarily abstracted from them to perform at a high level. I would not only agree with this, but also posit that we have to go one step further. We not only need to abstract the infrastructure from data scientists, but help them take ownership of their models all the way to production.

If a team leaves writing ML pipelines for later, they will quickly accrue technical debt. With the right abstractions, an organization can incentivize their data scientists to start writing end-to-end ML pipelines early in the development process. Once data scientists start writing their training/development workflows with easily ‘transferable’ ML pipelines, they would find themselves in a familiar environment once these same pipelines go into production. They would then have the tools necessary to also go into production systems and fix problems as they occur, or make improvements as time goes on.

Ideally, the pipeline becomes a mechanism through which the producers of the models (i.e. the data scientists) can take ownership of their models all the way to production.

Note that this does NOT mean that data scientists should now know every toolkit in the engineering toolbox (like complex deployments of Kubernetes clusters). Rather, the argument is that we need to set data scientists up so that they can take their code to production themselves, with the help required from necessary tooling.

Enter ZenML: A framework designed for modern MLOps

ZenML is an open-source MLOps Pipeline Framework built specifically to address the problems above. Let’s break it down what a MLOps Pipeline Framework means:

MLOps: It operates in the domain of operationalizing machine learning, i.e., putting ML workflows in production.
Pipeline: It does this by helping create pipelines, i.e., a sequence of steps performed in this case, specifically for a ML setting.
Framework: Finally, it creates these pipelines software with abstractions providing generic functionality, which can be selectively changed by additional user-written code.

So it's a tool that let’s you define pipelines but how is it different from the others? Here is what sets it apart:

Accommodating the Exploding ML Tooling Landscape

Everybody knows that we are now in the midst of an explosion in the ML/MLOps tooling landscape. ZenML is explicitly designed to have no opinions about the underlying infrastructure/tooling that you would like to use. Rather it exposes higher level concepts like Metadata Stores , Artifact Stores, and Orchestrators that have common interfaces. A ML team can then swap out individual components of their pipelines backends and it will ‘just work’.

So, if you even want to use MLFlow to track your experiments, run the pipeline on Airflow, and then deploy a model to a Neptune Model Registry, ZenML will facilitate this MLOps Stack for you. This decision can be made jointly by the data scientists and engineers. As ZenML is a framework, custom pieces of the puzzle can also be added here to accommodate legacy infrastructure.

Focus on Machine Learning Workflows (at every phase)

There are many tools that let you define workflows as pipelines but few that focus explicitly on machine learning use-cases.

@trainer
def trainer(dataset: torch.Dataset) -> torch.nn.Module:
   ... 
   return model

As an example, in the above code, ZenML will understand that this is not just a step in the pipeline, but a trainer step. It can use that information to aid in ML-specific use-cases like storing the result in a model registry, deploying this step on GPUs, hyper-parameter tuning etc.

Also notice that ZenML fully supports objects from common ML frameworks like torch.Dataset and torch.nn.Module. These objects can be passed between steps and results cached to enable faster experimentation.

Incentivizing the data scientist to write these pipelines

ZenML understands that pipelines are going to change over time. Therefore, it encourages running these pipelines locally to begin with and experimenting with the results as they are produced. You can query pipelines in a local Jupyter notebook, and materialize it with different pre-made visualizations like statistics visualizations and schema anomalies.

This is a different approach to pipeline development, and is more representative of how a data scientist would like to work in the earlier phases of a project → i.e. with fast, quick iterations and visualizations that help them to make informed decisions about experiments. We call this approach Pipelines As Experiments (PaE).

In short, ZenML allows you to create automated ML workflows with simple, extensible abstractions that lift the common pattern burden off of you. In doing so, it does not take opinions on the underlying infrastructure and aims to be cloud- and tooling-agnostic.

By helping the target audience, i.e. the data scientist, to write their code in ZenML pipelines early in the development lifecycle, the transition from the experimentation phase to the production phase is made much easier. The goal is that by the time the experimentation phase of the ML lifecycle is over, the data scientists can flip a switch to the production ML stack and get their pipelines running in production. At this moment, they would have complete ownership of these pipelines and can manage, update, and debug them as they please.

The Story So Far

To date, ZenML has received:

1336 GitHub Stars
10 Contributors
~200 Slack Members

It has been fantastic to see people’s interest in what was initially just a simple idea. We’re now building ZenML out in the open (no stealth here), and just made a major release with a complete refactor of the codebase. So, if any of the above appealed to you, it would be lovely if you gave ZenML a spin with an end-to-end example of deploying a pipeline in production. Feedback and contributions are welcome!

My sincerest thanks to Alex Strick and Adam Probst for helping edit this article.

Why ML should be written as pipelines from the get-go

Hamza Tahir — Mon, 06 Dec 2021 12:31:51 +0000

Today, Machine Learning powers the top 1% of the most valuable organizations in the world (FB, ALPH, AMZ, N etc). However, 99% of enterprises struggle to productionalize ML, even with the possession of hyper-specific datasets and exceptional data science departments.

Going one layer further into how ML propagates through an organization reveals the problem in more depth. The graphic below shows an admittedly simplified representation of a typical setup for machine learning:


Figure 1: Why it’s hard to reproduce ML models

There are three stages to the above process:

Experimenting & PoCs:

Technologies: Jupyter notebooks, Python scripts, experiment tracking tools, data exploration tools
Persona: Data scientists
Description: Quick and scientific experiments define this phase. The team wants to increase their understanding of the data and machine learning objective as rapidly as possible.

Conversion:

Technologies: ETL pipelining tools such as Airflow
Persona: Data Engineers
Description: Converting finalized experiments into automated, repeatable processes is the aim of this code. Sometimes this starts before the next phase, some times after, but the essence is the same — take the code from the data scientists and try to put them in sort form of an automated framework.

Productionalization & Maintenance:

Technologies: Flask/FastAPI, Kubernetes, Docker, Cortex, Seldon
Persona: ML Engineers / Ops
Description: This is the phase that starts at the deployment of the model, and spans monitoring, retraining, and maintenance. The core focus of this phase is to keep the model healthy and serving at any scale, all the while accounting for drift.

Each of these stages requires different skills, tooling, and organization. Therefore, it is only natural that there are many potholes that an organization can run into along the way. Inevitably things that are important downstream are not accounted for in the earlier stages. E.g. If training happens in isolation from the deployment strategy, that is never going to translate well in production scenarios — leading to inconsistencies, silent failures, and eventually failed model deployments.

The Solution

Looking at the above multi-phase process in Figure 1, it seems like a no-brainer to simply reduce the steps involved and therefore eliminate the friction that exists between them. However, given the different requirements + skillsets for each step, this is easier said than done. Data scientists are not trained or equipped to be diligent to care about production concepts such as reproducibility — they are trained to iterate and experiment. They don’t really care about code quality and it is probably not in the best interest of the company at an early point to be super diligent in enforcing these standards, given the trade-off between speed and overhead.

Therefore, what is required is an implementation of a framework that is flexible but enforces production standards from the get-go. A very natural way of implementing this is via some form of pipeline framework that exposes an automated, standardized way to run ML experiments in a controlled environment. ML is inherently a process that can be broken down into individual, concrete steps (e.g. preprocessing, training, evaluating, etc), so a pipeline is a good solution here. Critically, by standardizing the development of these pipelines at the early stages, organizations can lose the cycle of destruction/recreation of ML models through multiple toolings and steps, and hasten the speed of research to deployment.

If an organization can incentivize their data scientists to buy into such a framework, then they have won half the battle of productionalization. However, the devil is really in the details — how do you give data scientists the flexibility they need for experimentation in a framework that is robust enough to be taken all the way to production?

An exercise in finding the right abstractions

Having motivated writing in pipelines from the get-go, it is only fair that I give more concrete examples of frameworks on how to achieve this. However, in my opinion, currently, the tooling landscape is too split into frameworks that are ML tools for ML people, or Ops tools for Ops people, not really satisfying all the boxes I mentioned in the last section. What is missing is an Ops (read pipelines) tool for ML people, with higher-order abstractions at the right level for a data scientist.

In order to understand why this is important, we can cast an eye towards how web development has matured from raw PHP/jQuery-based scripts (the Jupyter notebooks of web development) with the LAMP stack to the powerful React/Angular/Vue-based modern web development stacks of today. Looking at these modern frameworks, their success has been dictated by providing higher-order abstractions that are easier to consume and digest for a larger audience. They did not change the fundamentals of how the underlying web technology worked. They simply re-purposed it in a way that is understandable and accessible to a larger audience. Specifically, by providing components as first-class citizens, these frameworks have ushered in a new mechanism of breaking down, utilizing, and resharing the HTML and Javascript that powers the modern web. However, ML(Ops) does not have an equivalent movement to figure out the right order of abstraction to have a similar effect.

To showcase a more concrete example of my more abstract thoughts above, I’ll use ZenML, an open-source MLOps framework to create iterative, reproducible pipelines.

Disclaimer: I am one of the core maintainers of ZenML.

ZenML is an exercise in finding the right layer of abstraction for ML. Here, we treat pipelines as first-class citizens. This means that data scientists are exposed to pipelines directly in the framework, but not in the same manner as the data pipelines from the ETL space (Prefect, Airflow et al.). Pipelines are treated as experiments — meaning they can be compared and analyzed directly. Only when it is time to flip over to productionalization, can they be converted to classical data pipelines.


Figure 2: ZenML abstract pipelines with familiar language to increase ownership of model deployments.

Within pipelines are steps, that are abstracted in familiar ML language towards the data scientist. e.g. There is a TokenizerStep, TrainerStep, EvaluatorStep and so on. Paradigms that are way more understandable than plugging scripts into some form of orchestrator wrapper.

Each pipeline run tracks the metadata, parameters and can be compared to other runs. The data for each pipeline is automatically versioned and tracked as it flows through. Each run is linked to git commits and compiled into an easy-to-read YAML file, which can be optionally compiled to other DSL’s such as on Airflow or Kubeflow Pipelines. This is necessary to satisfy other stakeholders such as the data engineers and ML engineers in the value chain.

Additionally, the interfaces exposed for individual steps are mostly set up in a way to be easy to extend in an idempotent, and therefore a distributed, manner. The data scientist can therefore scale-out with different processing backends (like Dataflow/Spark) when they are dealing with larger datasets.
All in all, ZenML is trying to get to the following scenario:


Figure 3: ZenML unifies the ML process.

Of course, ZenML is not the only mechanism to achieve the above — Many companies build their own home-grown abstraction frameworks to solve their specific needs. Often-times these are built on top of some of the other tools I have mentioned above. Regardless of how to get there, the goal should be clear: Get the data scientists as close to production as possible with as little friction as possible, incentivizing them to increase their ownership of the models after deployment.

This is a win-win-win for every persona involved, and ultimately a big win for any organization that aims to make it to the top 1% using ML as a core driver for their business growth.

Plug

If you like the thoughts here, we’d love to hear your feedback on ZenML. It is open-source and we are looking for early adopters and contributors! And if you find it is the right order of abstraction for you/your data scientists, then let us know as well via our Slack — looking forward to hearing from you!

Spot the difference in ML costs

Hamza Tahir — Mon, 06 Dec 2021 12:28:00 +0000

Every organization at any scale understands that leveraging the public cloud is a trade-off between convenience and cost. While cloud providers like Google, Amazon and Microsoft have immensely reduced the barrier of entry for machine learning, GPU costs are still at a premium.

There is an increasing fear in the machine learning community that the true power of machine learning is still within the hands of the few. The flagship example of this is OpenAI's massive GPT-3 model containing 175 billion parameters, a memory footprint of 350GB and reportedly costing at least $4.6 million to train. The trend also looks set to continue: Rumours consistently circulate regarding the next generation GPT-4's size, with some estimates ranging in the order of trillions of parameters. Even with more efficient training techniques, these models will still cost in the order of millions to train.

For the rest of the ML community, there is now an increasing reliance on their secret weapon: Transfer learning. Just recently, the excellent HuggingFace library announced a simple method to fine-tune large-scale parameter models on a single cloud GPU. This gives hope to ML practitioners that even if they are unable to train models from scratch, utilizing the immense power of modern-day machine learning models is still within reach.

The need for the cloud

Whether training from scratch, or fine-tuning, it is still clear that public cloud providers offer the most convenient path to provision and utilize compute resources for most ML practitioners out there. However, even for fine-tuning tasks or smaller models, these costs can quickly grow and become unmanageable. For example, here is a simple breakdown of much it costs to train a machine learning model on a relatively mid-tier configuration suitable for many machine learning tasks:

| Provider | Configuration           | Cost ($/h)    |
|----------|-------------------------|---------------|
| GCP      | n1-standard-16 with P4  | 1.36          |
| Azure    | NV6 with M60            | 1.14          |
| AWS      | g3.4xlarge with M60     | 1.14          |

Bear in mind that the above costs are just for one training run. Most machine learning projects go through much more experimentation phases, and these numbers add up quickly. Therefore, most ML teams who do not have vast budgets on their hands usually resort to sampling their datasets and portioning out big training runs for when they are sure. This can be slow and tedious, not to mention hard to coordinate. It can also lead to teams converging to the wrong results. If the smaller, sampled datasets are not representative of the larger dataset, it can lead to frustrating and diverging results as the models develop.

Spot instances: A perfect fit for ML experimentation

Would it not be easy if ML practitioners had the luxury to launch experiments without having to fret so much about the costs exploding over time? There might be just be a solution, offered by all major cloud providers, and severely underutilized by the machine learning community: Preemptible/Spot instances.

The word preemptible instance is largely a Google Cloud Platform term, while spot instance is used by AWS/Azure. Whatever you call it, the concept is the same: These instances cost a fraction of the cost of normal instances, and the only catch is that there is no guarantee that the instance stays up all the time. Usually, this means that within 24 hours the instance is shut down by the provider.

These sorts of instances are a mechanism for the cloud providers to maximize the utilization of all their resources at any given time. They are intended for batched, non-critical workloads. Most trainings jobs for the majority of the use-cases out there take less than 24 hours to complete. And even if the job is interrupted before that happens, they can almost always be restarted from checkpoints.

Cost comparison: 80% cost reduction

Therefore, machine learning training fits the intended use of spot instances perfectly. By using these instances, practitioners stand to gain massive cost reductions. We have conducted a rough analysis across the three major cloud providers to showcase the cost benefits. The raw data can be found here. Feel free to share the doc and leave a comment if you find something to add. Here is a snapshot, with the same configurations as before:

| Provider | Configuration           | Cost ($/h) | Spot Cost ($/h) | Savings |
|----------|-------------------------|------------|---------------------------|
| GCP      | n1-standard-16 with P4  | 1.36       | 0.38            | 72%     |
| Azure    | NV6 with M60            | 1.14       | 0.20            | 82%     |
| AWS      | g3.4xlarge with M60     | 1.14       | 0.34            | 70%     |

Note: All costs in the US region, AWS instance pricing as of January 28, 14:00 CET.

As can be seen, depending on the configuration, there is up to 82% cost reduction by using spot instances, with the average cost savings across multiple cloud and configurations being rougly 74%. This can equate to hundreds of dollars worth of savings. Especially for hobbyists, smaller companies, or smaller departments experimenting with Machine Learning, this may mean the difference between getting a model deployed vs crashing and burning before lift-off.

Using this technique is not new: Way back in 2018, the FastAI team trained ImageNet in 18 minutes with 16 AWS spot instances. This cost $40 at the time, and was the most public display of the insane cost benefits of spot instances in the community.

However, given the trend of increasingly big models, and the increasing adoption of AI worldwide, I can only see the need for spot instance training increasing over time. Given the dramatic difference in costs, it is almost a no-brainer to use spot instance training as a primary mechanism for training, at least in the experimentation phase.

ZenML: A simple tool to orchestrate spot instance training

If you're looking for a head start for spot instance training, check out ZenML, an open-source MLOps framework for reproducible machine learning. Running spot pipeline in ZenML, is as easy as :

training_pipeline.run(
    backend=OrchestratorGCPBackend(
        preemptible=True,  # reduce costs by using preemptible (spot) instances
        machine_type='n1-standard-4',
        gpu='nvidia-tesla-k80',
        gpu_count=1,
    )
)

See gist here

ZenML not only zips your code up to the instance, it also makes sure the right CUDA-drivers are enabled to take advantage of the accelerator of your choice. It provisions the instance, and spins it down when the pipeline is done. Not to mention the other benefits of experiment tracking, versioning and metadata management which is provided anyway by the framework. Give it a spin yourself: A full code example can be found here.

AWS and Azure support is on the horizon, and we'd love your feedback on the current setup. If you like what you see, leave us a star at the GitHub repo!

Is your Machine Learning Reproducible?

Hamza Tahir — Mon, 06 Dec 2021 12:26:19 +0000

It is now widely agreed that reproducibility is an important aspect of any scientific endeavor. With Machine Learning being a scientific discipline, as well as an engineering one, reproducibility is equally important here.

There is widespread fear in the ML community that we are living through a reproducibility crisis. Efforts like the Papers with Code Reproducibility Challenge, signaled a clear call-to-action for practitioners, after a 2016 Nature survey revealed that 70% of results are non-reproducible.

While a lot of the talk amongst the community has centered on reproducing machine learning results in research, there has been less focus on the production side of things. Therefore, today let’s focus more on the topic of reproducible ML in production and create a larger conversation around it.

Why is reproducibility important?

“If you can’t repeat it, you can’t trust it” - All Ops Teams

A good question to start with is why exactly reproducibility is important, for machine learning in particular. Here are list of benefits one gains by ensuring reproducibility:

Increases trust
Promotes explainability of ML results
Increases reliability
Fulfills ethical, legal, and regulatory requirements

Concretely, ML models tend to go through a lifecycle of being destroyed, forged anew and re-created as development evolves from rudimentary notebook snippets to a testable, productionized codebase. Therefore, we better make sure that every time a model is (re-) trained, the results are what we expect them to be.

What's the big deal?

One would think that reproducibility in production ML should be easy. After all, most machine learning is scripting. How hard can it be to simply execute a bunch of scripts again at a later stage and come to the exact same result, right?

wrong

Reproducibility of machine learning is hard because it spans many different disciplines, from understanding non-deterministic algorithmic behaviors, to software engineering best practices. Leaving aside the fact that most machine learning code quality tends to err towards the low side (due to the experimental nature of the work), there is an inherent complexity to ML which makes things even harder.

E.g. Just training a model on the same data with the same configuration does not mean the same model is produced. Perhaps one could achieve a similar overall accuracy (or whatever other metric), but even a slight change in parameters might skew metrics for slices of your data - leading to sometimes very unpleasant results

So, how can we ensure that stuff like does not happen? In my opinion, one can break down reproducibility in the following aspects:

The code
The configuration
The environment
The data

Let's look at each of these in turn.

Code

Checking code into a version control system like Git ensures a clean trace of how code evolves, and the ability to rollback to any point in history. However, Git alone is not a fix for reproducibility, but only for one aspect of it.

In reality, reproducibility in production is solved by version control, testing of code as well as integrations, and idempotent deployment automation. This is hard to apply in practice. E.g. The main tool for ML is Jupyter notebooks, which are notoriously difficult to check into version control. Even worse, most notebook code is not sequential in its execution, and can have an arbitrary, impossible to reproduce, sequence of execution.

But even if ML practitioners follow a pattern of refactoring their code into separate modules, simply checking modules into source control is still not enough to ensure reproducibility. One needs to link the commit history to model training runs and models. This can be achieved e.g. by enforcing a standard in your team that pins a git sha to experiment runs. That way there is a global unique ID that ties the code and configuration (see below) to the results it produced.

Configuration

Software Engineering preaches separation of application code and application configuration to allow for predictable and deterministic software behavior across environments. This actually translates well in a machine learning code: E.g. one can separate the model definition and training loop code, from the associated hyper-parameters which define the configuration.

The first step to unlock reproducibility is to actually separate configuration from code in the first place. For me this means, the code itself should NOT define:

Features
Labels
Split parameters (e.g. 80-20 split)
Preprocessing parameters (e.g. the fact that data was normalized)
Training hyper-parameters (including pre-processing parameters)
Evaluation criteria, .e.g, metrics

Ideally all these are tracked separately in a declarative config that is human readable.

Environment

If a ML result is produced on a dev local machine, there is a high chance it is not going to be reproducible. Why? Because developers, especially relatively inexperienced ones, are not super diligent in creating and maintaining proper virtual environments.

The obvious solution for this one is containerizing applications, with lets say, Docker. However, here is another example of when skills of ML practitioners begin diverging from conventional software engineers. Most data scientists are not trained in these matters, and require proper organizational support to help and encourage them to produce containerized applications.

Data

And finally, we arrive at the data. Data versioning has become one of the most discussed topics in the production machine learning community. Unlike code, you can't simply check data into version control easily (although tools like DVC are attempting just that).

In the same way as code, achieving basic versioning of data does not necessarily ensure reproducibility. There is a whole bunch of metadata associated with how data is utilized in machine learning development, all of which is necessary to persist to ensure trainings are reproducible.

Here is a simple, but common, example that illustrates this point. If you have ever worked with machine learning, have you ever created a folder/storage bucket somewhere that has random files in varying preprocessing states? Something like, normalized_1.json or perhaps even timestamped 12_02_19.csv? Technically, a timestamped file is versioned data, but that does not mean associated runs with it are reproducible: One would have to know how, when and where (i.e. the aforementioned metadata) these versioned files are used to ensure reproducibility.

Concrete Example

While it may fall outside the scope of this blog, the open-source MLOps framework ZenML showcases a clear example
of putting these principles in action here

Conclusion

Reproducibility in machine learning is not trivial, and ensuring it in production is even harder. ML teams need to be fully aware of the precise aspects to track in their processes to unlock reproducibility.

If you’re looking for a head start to enable reproducibility: check out ZenML, an open-source MLOps framework for reproducible machine learning - and leave a star while you're there.

Also, hop on over to our Slack Channel if you want to continue the discussion.

I’ll be back in a few days to talk about using Git in a machine learning setting - stay tuned!

Can you do the splits?

Hamza Tahir — Mon, 06 Dec 2021 12:24:45 +0000

One attempt to ensure that ML models generalize in unknown settings is splitting data. This can be done in many ways, from 3-way (train, test, eval) splits to k-splits with cross-validation. The underlying reasoning is that by training a ML model on a subset of the data, and evaluating on unknown data, one can reason much better if the model has underfit or overfit in training.

For me, splitting data is the most under-rated task in all of data science. It is understandable that for most jobs, a simple 3-way split suffices. However, I have stumbled across many problems where there is a need for more complicated splits to ensure generalization. These splits are more complex because they are derived from the actual data, rather than the structure of the data which the hitherto mentioned split methods are based on. This post attempts to break down some of the more unconventional ways to split data in ML development, and the reasoning behind them.

Lets start with a dataset

In order to illustrate the split mechanisms, it helps to start with a sample dataset to do the splits on. To make things easy, lets use a simple multi-variate, timeseries dataset represented in tabular format. This data consists of 3 numerical features, 1 categorical feature and 1 timestamp feature. Below this is visualized:

This type of dataset is common across many use-cases and industries in machine learning. A concrete example can be multiple timestreams transmitted from different machines with multiple sensors on a factory floor. The categorical variable would then be the ID of the machine, the numerical features would be what the sensors are recording (e.g. pressure, temperature etc.), and the timestamp would be when the data was transmitted and recorded in the database.

Doing the splits

Imagine you receive this dataset as a csv file from your data engineering department and are tasked with writing a classification or a regression model. The label in such a case could be any of the features or an additional column. Regardless, the first thing to do would be to try to split up the data into sets that are meaningful.

To make things easy, you decide to go make a simple split with train and eval. You know immediately that a naive random split with shuffling won't fly here - the data does have multple sensor streams that are indexed by time after all. So how do you split the data so that order is maintained and subsequent models are sufficiently generalizable?

Another view of the data

The most straightforward transformation we can do is to represent the data per categorical class (in our running example, visualize the data per machine). This would yield the following result:

The Horizontal Split

The grouping together suddenly makes the issue of splitting a bit simpler, and largely dependant on your hypothesis. If the machines are running under similar conditions, one question you might ask is: How would a ML model trained on one group generalize to other groups. That is, if trained on class_1, class_2 and class_3 timestreams how would the model fair on class_4 and class_5 timestreams. Here is a visualization of that split:

I call this the Horizontal split due to nature of the cut line in the above visualization. This split can be easily achieved in most ML libraries by simply grouping by the categorical feature and partitioning along it. A successful training with this split would show evidence that the model has picked up signals that generalize across previously unseen groups. However, it would not showcase that it is able to predict future behavior of one group.

Its important to note that the the split decision did NOT account for time as a basis of the split itself. One can assume however that you would also sort by time per timestream to maintain that relationship in your data. Which brings us to the next split..

The Vertical Split

But what if you want to split across time itself? For most time-series modelling, a common way to split the data is past and future. That is, to take in the training set historical data relative to the data in the eval set. The hypothesis in this case would be: How would a ML model trained on historical data per group generalize to future data for each group?. This question might be answered by the so called Vertical split:

A successful training with this split would showcase that the model is able to pick up patterns across timestreams it has already seen, and make accurate predictions of behavior in the future. However, this itself would not show that this model will generalize well to other timestreams from different groups.

Of course, your multiple timestreams have to be sorted now individually, so we still need to group. However, this time, rather than cutting across groups, we take a sample of the past of each group and put it in train and the future of each group in eval. In this idealized example, all the timestreams are of the same length, i.e., each timestream has exactly the same number of data points. However, in the real world, this maybe not be the case - so you would require a system to build an index across each group to make this split.

The Hybrid Split

An inquisitive ML researcher might at this point wonder if they could produce a model that would generalize under both constraints of the Horizontal and the Vertical split. The hypothesis in that case would be: How would a model trained on historical data for SOME groups generalize to future data of these groups AND all data from other groups?. A visualization of this Hybrid split would look like this:

Naturally, if model training is successful, this model would surely be more robust than the others in a real world setting. It would have displayed evidence to not only learning patterns of some of the groups it has already seen, but also evidence of the fact that it has picked up signals that generalize across groups. This might be useful if we are to add more similar machines to the factory in the future.

Multi-dimensional splits

The notion of the horizontal and vertical splits can be generalized to many dimensions. For example, one might want to do group by two categorical features rather than one to even further isolate sub-groups in the data, and sort them per sub-group. There might also be complex logic in the middle to filter groups that have a lower number of samples, and other business-level logic pertaining to the domain.

Conclusion

The hypothetical example is used to illustrate the endless possibilities of various kinds of machine learning splits that can be created by an astute data scientist. Just like it is important to ensure ML fairness whilst evaluating your models, it is equally important to spend sufficient time thinking about splitting a dataset and its consequences to bias the model downstream.

One easy way to do the Horizontal, Vertical and the Hybrid split by writing just a few lines of YAML is via ZenML. ZenML is a MLOps framework developed while we deployed models to production, for datasets with similar characteristics as the example above. If you are interested in the content above, and would like to try ZenML, please feel free to reach out to me at hamza@zenml.io. Head over to our docs to understand more how it works in more detail.

Thank you and happy splitting!

Avoiding technical debt with ML pipelines

Hamza Tahir — Mon, 06 Dec 2021 12:21:14 +0000

Okay, lets make it clear at the start: This post is NOT intended for people who are doing one-off, silo-ed projects like participating in Kaggle competitions, or doing hobby projects on jupyter notebooks to learn the trade. The value of throw-away, quick, diry script code is obvious there - and has its place. Rather, it is intended for ML practitioners working in a production setting. So if you're working in a ML team that is struggling to manage technical debt while pumping out ML models, this one's for you.

A typical workflow

Michael Mayer on Flickr" width="880" height="755">

Here is the premise: You're a ML/DL/AI engineer/analyst/scientist working in a startup/SME/enterprise. It's your job to take a bunch of random data from all over the place, and produce value. What do you do? You sit down, somehow get the data on to your local machine, and inevitably do something along the lines of:

jupyter notebook

Or you go to a colab if you're fancy and your teams privacy rules allow it.

Following is a story ive seen so many times before - pseudo-code of course.

import pandas as pd
import xyzlibraryforml

# CELL 1: Read
df = pd.read_*("/path/to/file.*")
df.describe()

# INSERT HERE: a 100 more cells deleted and updated to explore data.

# CELL 2: Split
train, eval = split_the_data()  # basic

# INSERT HERE: trying to figure out if the split worked

# CELL 3: Preprocess
# nice, oh lets normalize
preprocess(train)
preprocess(eval)

# exploring preprocessed data, same drill as before

# CELL 4: Train
model = some_obfuscated_library.fit(train, eval)  # not being petty at all

# if youre lucky here, just look at some normal metrics like accuracy. otherwise:

# CELL 5: Evaluate
complicated_evaluation_methods(model)

# INSERT HERE: do this a 1000 times

# CELL 6: Export (i.e. pickle it)
export_model()

So you're done right? Thats it - boom. Test set results are great. Lets give it to the ops guy to deploy in production. Lunch break and reddit for the rest of the day!

Okay, am I grossly exaggerating? Yes. Is this scarily close to the truth for some businesses? Also yes.

So, whats the problem?

The problem is that this notebook above is a ball of technical debt that will keep growing, if not culled early enough. Lets break down what was wrong with it:

Not moving towards a bigger picture

When you put generalizable logic into non-versioned, throw-away notebook blocks, it is taking away the ability for your team to take advantage of it. For example, the logic of loading/extracting your data from your static datasource. Sure a pd.read_json is easy right now, but what happens if the format changes? Worse, what happens if the data grows and is split into multiple files? Even worse, what happens if it wont fit into memory any more? What happens if your colleague runs into the same problem - she'll probably go through the same loops as you did not even knowing its an already solved problem. Sure, there are solutions to all these problems but are you going to keep solving them in your local notebook?

The answer is probably no (unless you're netflix or something). Usually, the logical thing to do is to extract the loading into a logically separate service. This way, you can abstract the actual extraction of data away from yourself and into a maintainable layer transparent to everyone. For example, this could be some form of feature store that collects all the various data streams in your organizations into one point, which can then be consumed in defined APIs by everyone in your team.

The same applies to the pre-processing, training and evaluation part of the script above.

Building logic locally (and therefore again and again)

As in the above example, when you write code to explore data, you generate tons of great stuff - visualizations, statistical dataframes, cleaned and preprocessed data, etc. The random, arbitrary order of execution of jupyter notebooks ensures that the path taken to realize these artifacts is forever lost in overridden local variables and spontaneous kernel restarts. Even worse, the logic has complex, easily overridden configurations embedded deep inside the code itself - making it much harder to recreate artifacts.

Look, I understand - I do it all the time. Data wrangling is a random, grinding process, and its gonna be messy. But setting up some framework to keep a track of your data exploration pipelines is going to pay off big time. Similar to commenting your code, the biggest beneficiary of keeping track is going to be yourself. Also, your team will be faster and avoid redundant work if these artifacts and mini-answers to questions were made transparent and clear to everyone automatically.

Not owning the deployment

The last part of this notebook is probably the most frustrating. I really do not believe that the job of the person writing that notebook ends with exporting the model for ops. It does not make any sense.

First of all, that preprocessing() function has to go with it, otherwise training-serve skew is going to crash your model from the get-go. Secondly, how on earth is the ops person supposed to know what assumptions you took while building the model? Are you going to write extensive documentation of what data goes in, how it should be preprocessed, and which shape it should be in when deploying to an endpoint? I mean there are now automated ways of doing this - so own the deployment!

One aspect lacking in the above script is a disregard for measuring the performance of a model. Most data scientists I know do not care how big the model is, how much memory it consumes for predictions, and how fast/efficient it is in deployment. The model does not produce value if it does not fulfill the performance criteria of the end application. Again, the person developing the models should have ownership of its end deployment.

Suggestions

The easiest way to fix the above is to develop a framework where a ML team can balance throw-away exploratory code development with maintainable, easy-to-share code. If you are to do this, you might want to keep the following in mind:

Create well-defined interfaces (i.e decompose into pipelines)

The split, transform, train, evaluate, deploy components of your workflow are logically independent entities/services. Architect a setup where individual components of your ML workflow are abstracted away from concrete implementations. This could be by defining actual interfaces object-oriented style, or simple ensuring that your repo has some form of structure that is easy for everyone to contribute to and extend. This does not have to be rocket-science at the start, but it will help enormously.

This is where the notion of ML pipelines come into play: Pipelines are abstract representations that define a series of data processing tasks. Thinking in pipelines would help your team separate out logical entities in their workflow, and have data flow through it independently. This would inevitably yield a more robust, maintainable codebase. Also, defining ML pipelines like this ensures you can automate continuous training of your stale models on new data as it comes in. However, you need to also track your data metadata for that (See below).

Make a plan for your ML Metadata

Every experiment you run yields ML metadata: who ran it, when was it run, what data went in, where are the results stored etc. Make sure you map these out and provide a convenient way to add to this store. Important to note: I am not just talking about experiment tracking either. There are many wonderful libraries out there that will help in tracking the model-centric metadata, i.e., metrics etc. However, what is often neglected is the data-centric metadata - especially if the data keeps changing. Stuff like data versioning, statistics, visualizations, what seed was used when random splitting - stuff like that. There should be an easy way to trace the various routes your data takes in your development.

Ensure that your workloads can run in any environment

Running a workload on a single machine is always going to be easier than making the same code run on any arbitrary environment. I know Docker is inapproachable and hard for many ML people, but at least make a requirements.txt and add a __init__.py! Ideally, containerize your code and run experiments on some form of orchestration framework. Doing this one step now will save a lot of pain when you scale and automate this whole thing to work on bigger data.

Do not separate deployment from training

This is perhaps the most no-brainiest of all no-brainer suggestions so far. End to end ownership lead to the whole DevOps revolutions 20 years ago, and that has not gone away in ML development either. Provide a smooth mechanism to transfer a trained model to an endpoint, and make sure the Ops people are sitting in the same room (not always literally) as your ML developers. Put in place processes where everyone understands the end goal in production. Automate things when possible.

Do not compromise on repeatability and traceability

You know when people start coding in Python and then move to C++ or Java and they do not understand concepts like pointers and static typing? They think: "What is the use of giving a variable a type, I know what it is so why am I forced to do this?". Well, sorry to break it to you, but pointers and static typing have a purpose - knowing them protects your code from your own failings and ensures high quality robust output. Ultimate flexibility can be a bad thing, especially with developers who tend to err towards laziness (like me).

Something very similar happens in Jupyter notebooks - the freedom of running any arbitrary code in any order gives freedom but makes you lose the very important notion of repeatability and traceability, two corner-stones of any robust, production ready engineering discipline. Please, at least ensure that your notebooks are executable top-down and in a repeatable manner. Cluttered and randomly ordered jupyter notebooks should be punished with long rants like this one on Code Review meetings.

One way of ensuring both traits is to extract 'settings' of your code from the implementation. Which brings me to my next point...

Separate configuration from implementation

Separating configuration from actual code implementation is definitely a pain. Yet, this is another one of those 'pays off in the long run' sort of things. We've written about it before but to summarize: Separating your configuration allows you to automate repetitive tasks, increases predictability of results, and ensures reproducibility. Ideally, configuration should be treated as code, versioned, and maintained.

Conclusion

ML practitioners in many organizations are heavily incentivized to make quick-wins to product early results. However, this leads to accumulated technical debt which eventually slows down the team over time. The solution is to follow proper software engineering principles from the get-go, and lean on guidelines to strike a balance between rapid results and high quality software development.

The thoughts above are very personal lessons I have learnt over the last 3 years deploying models in production. As a result at zenml, we have created ZenML to provide a framework for ML developers to solve some, if not all, of the problems noted above. Reach out to me at hamza@zenml.io if you are interested in giving the platform a go. Head over to our docs to understand more how it works in more detail.

Deep Learning on 33,000,000 data points using a few lines of YAML

Hamza Tahir — Mon, 06 Dec 2021 12:11:38 +0000

Over the last few years at zenml, we have regularly dealt with datasets that contain millions of data points. Today, I want to write about how the we use our machine learning platform, ZenML, to build production-ready distributed training pipelines. These pipelines are capable of dealing with millions of datapoints in a matter of hours. If you also want to build large-scale deep learning pipelines, sign up for ZenML for free here and follow along.

Datasource

A good way to get a hold of a dataset of the size we want is public Google BigQuery tables.
The one I chose for today's example is the New York Citi Bike dataset, which contains 33 million data points, holding information about various bike sharing trips in New York City. Here is a snippet of what the datasource looks like (*only relevant columns shown):

   birth_year | gender   |   end_station_id |   start_station_id |   tripduration | usertype
--------------+----------+------------------+--------------------+----------------+------------
         1977 | Female   |              103 |                100 |           1012 | Subscriber
         1991 | Male     |             1089 |                 23 |            530 | Customer
... etc. etc. 33 million more times

Our mission (if we choose to accept it) is to see if we can infer the birth_year of the person,
given all the rest of the data in this table.

Sound interesting? Alright, let's begin.

Building the Pipeline

When dealing with a dataset this large, its difficult to do some Pandas magic in a Jupyter notebook to wrangle with our data - I won't subject my poor ThinkPad to that punishment. Thats why we created ZenML to deal with this problem (amongst others).
For this post, I will assume you have the cengine CLI installed and ready to go.

As a summary, the cengine CLI will create, register and execute training pipelines,
which will be managed by us on our cloud platform. One can create the pipeline declaratively by
specifying a YAML configuration file.

For this example, I created a simple feedforward neural network pipeline. Here's how I did it:

Step 0: Add the datasource

First thing to do is create a data source. As the BigQuery table is public, it can be added by running:

cengine datasource create bq --name citibike_trips \
  --project "bigquery-public-data" \
  --dataset new_york \
  --table citibike_trips \
  --table_type public

After that you can run

cengine datasource list

And see the following details:

 Selection   |   ID | Name               |     Rows |   Cols |   Size (MB)
-------------+------+--------------------+----------+--------+-------------
 *           |   16 | citibike_trips     | 33319019 |     15 |        4689

The data contains 33,319,019 rows with 15 columns.

Step 1: Configure YAML - Features

Now we can build our YAML config. Usually I would use an easy-to-follow
configure command to create this, but for this post its easier to go section by section and build it manually. So open up a text editor
(I'm a Sublime Text guy but do it in VIM if you wish, whatever floats your boat):

features:
  end_station_id: {}
  gender: {}
  start_station_id: {}
  tripduration: {}
  usertype: {}

This will define the features we want to use for our pipeline. I dropped some features that I thought were redundant or could bias the model (like Bike ID). I mean, the model should have a challenge, right?

Also note that I didn't to any fancy embedding of start and end stations.
As Andrew Ng says: "Don’t start off trying to design and build the perfect system. Instead, build
and train a basic system quickly". So lets get to a baseline first.

Step 2: Configure YAML - Label

Ok next part is the label. Thats also easy:

labels:
  birth_year:
    loss: mse
    metrics: [mae]

So we define birth_year as the label, and say we want a mse (mean_squared_error) loss on the model. The metric I'll be tracking are mae (mean absolute error).

Step 3: Configure YAML - Split

So we need to split our data for this to make any sense. ZenML let's you split up the data in a variety of ways into train and eval (more splits support on its way!). Lets write:

split:
  categorize_by: start_station_name
  index_ratio: { train: 0.9, eval: 0.1 }

Three lines of YAML, but they pack a punch. ZenML will let you categorize your data before splitting it.
For our case, we want all start stations to be equally represented to avoid any biases. So we grouped by the start_station_name and divided each possible group in a 90-10 split. For you SQL folk, this is similar to doing a GROUP BY and then taking a partition over an index. This way our training and test data will have data with all the stations.

I feel like splitting up data is a very under-appreciated part of machine learning and plays an important part in ML fairness, so I tried to make an appropriate split here.

Step 4: Configure YAML - Trainer (Model definition)

We have arrived at, undoubtedly, the most interesting part of our YAML. The trainer, i.e., the actual model definition.

trainer:
  layers:
    - { type: dense, units: 64 } # a dense layer 64 units
    - { type: dense, units: 32 } # a dense layer with 32 units
  architecture: feedforward # can be feedforward or sequential
  last_activation: linear # last layer: we can take relu, but linear should also be fine
  num_output_units: 1 # How many units in the last layer? We choose 1 because we want to regress one number (i.e. date_of_birth)
  optimizer: adam # optimizer for loss function
  save_checkpoints_steps: 15000 # how many steps before we do a checkpoint evaluation for our Tensorboard logs
  eval_batch_size: 256 # batch size for evalulation that happens at every checkpoint
  train_batch_size: 256 # batch size for training
  train_steps: 230000 # two epochs
  type: regression # choose from [regression, classification, autoencoder]

It's quite straightforward really - we define 2 dense layers, set the optimizer and a few more nuts and bolts. The whole trainer follows quite closely the Keras API, so it would be quite straightforward for most people. The interesting bit about this trainer is the train_steps and batch_size. One step is one whole batch passing through the network, so with a 33 million datapoint dataset, 230,000 steps of 256 would be roughly 2 epochs of the data. Trust me, I did the math.

At this point you might be wondering what are the types of models you can create with this trainer key - so go ahead and read the developer docs for it. This part we're really trying to nail down and support for different sorts of models are always a priority.

Step 5: Configure YAML - Evaluation (Splitting Metrics)

Almost there! One last thing we might want to do is to add some evaluator slices. What does that mean? Well it means that we might may not just want to look at the overall metrics (i.e. overall mae) of the model, but the mae across a categorical column.

evaluator:
  birth_year: {} # I'd like to see how I did across each year
  gender: {} # I'd like to see if the model biases because of gender
  start_station_name: {} # I'd like to see how I did across each station

I defined three such columns which I was interested in seeing sliced metrics across. You'll see how this plays into the evaluation part of our pipeline in a bit.

The full config YAML

There are some things that I have intentionally skipped in the config for the sake of brevity. For reference, you can find the pipeline configuration ready to download here. I tried to annotate it with comments for clearer explanation. For further clarity, there is also always the docs to refer to. Most notably, the default key is perhaps important to look at as it defines the pre-processing steps that we took to normalize the data.

Run the pipeline

Ok now I can register a pipeline called nyc_citibike_experiment like so:

cengine pipeline push my_config.yaml nyc_citibike_experiment

ZenML will check your active datasource, and give an ops configuration that it deems suitable for the size of the job you're about to run. For this experiments, ZenML registered the pipeline with 4 workers at 96 cpus_per_worker. You can always change this if you want, but I decided to go for this configuration and ran the pipeline:

cengine pipeline run <pipeline_id>

Enter Y for the safety prompt that appears, and let it run!

You should see a success message with your chosen configuration. The platform will provision these resources in the cloud, connect automatically to the datasource, and create a machine learning pipeline to train the model. All preprocessing steps of the pipeline will be distributed across the workers and cpus. The training will happen on a Tesla K80 (distributed training coming soon!).

So now, you can sit back and relax. You don't need to watch dying Jupyter kernels or stare at as the steps go by on your terminal. Just grab a coffee, browse reddit, and chill.

Evaluate the results

While running, the status of a pipeline can be checked with:

cengine pipeline status --pipeline_id <pipeline_id>

Sample output:

   ID | Name                              | Pipeline Status   | Completion   |   Compute Cost (€) |   Training Cost (€) |   Total Cost (€) | Execution Time
-----------+-----------------------------------+-------------------+--------------+--------------------+---------------------+------------------+------------------
   1  | nyc_citibike_experiment           | Running           | 13%          |                  0 |                   0 |                0 | 0:14:21.187081

Once the pipeline hits the 100% completion mark, I can see the compute (preprocessing + evaluation) cost and training cost it incurred. For me, this pipeline took 74 minutes.

Preprocessing and training 33 million datapoints in just over an hour. Not too bad.

At that point, I can also evaluate it:

cengine pipeline evaluate <pipeline_id>

This opens up a pre-configured Jupyter notebook where I can view Tensorboard logs, along with the excellent Tensorflow Model Analysis (TFMA) plugin. Both of these will show me different things about the pipeline.

Tensorboard will show tensorboard things: The model graph, the train and eval loss etc. Here's how mine looks like:

That is pretty cool - Maybe we overtrained it at the 180,000th step as it took a jump in the loss, but the mae seems to keep decreasing. We're close to 9.6 mae overall, which isn't bad at all for this baseline model.

How about a deeper dive into the metrics? Thats where TFMA comes into play.
TFMA will show the metrics defined in the YAML and add the ability to slice the metric across the columns defined in the evaluator key. E.g. Lets slice it across birth_year to see how well it did for each year.

Note: If you want to replicate this step just add birth_year in the generated notebook code where its specified.

A deeper dive reveals that the model actually guessed the year of people born in 1977 pretty well (Thats tested on ~11000 samples from that year). So its definitely learning something. We can now dive which years it did worse, and also other slices and see if we can gain anything from that when we iterate on our model.

Wrap up

Now that we have the baseline model, its very simple to iterate on different sorts of models very quickly. The cool thing is that ZenML has stored all intermediate states of the pipeline (i.e. the preprocessed data) in an efficient and compressed binary format. Subsequent pipeline runs will warmstart the pipeline straight to the training part, given that everything else stays the same. This caching mechanism is actually quite powerful at this stage and can save up to 80% on time and cost. But I would leave that for a seperate post, where we can take the same pipeline and iterate on quickly to arrive at a more accurate model! So stay tuned for that.

If you liked this post, please make sure to follow us on Twitter, LinkedIn or just chat with us on our Discord server.

We're actively looking for beta testers to test the platform and we have a whole bunch of features coming up, including distributed training, automatic model serving, hyper-parameter tuning and image support.

Please visit the docs for details about
the platform, and if interested contact us directly!

In the meantime, stay safe and hope to see you all soon!

Why deep learning development in production is (still) broken

Hamza Tahir — Mon, 06 Dec 2021 12:07:28 +0000

Around 87% of machine learning projects do not survive to make it to production. There is a disconnect between machine learning being done in Jupyter notebooks on local machines and actually being served to end-users to provide some actual value.

Source: Hidden Technical Debt in Machine Learning Systems (Sculley et al.)

The oft-quoted Hidden Technical Debt paper, whose diagram can be seen here, has been in circulation since 2017, yet still, deep learning in production has a ways to go to catch up to the quality standards attained by more conventional software development. Here is one take on what is broken:

Data is not treated as a first-class citizen

In traditional software development, code is (rightly-so) **a first-class citizen. In ML development, there is a further need for data to be a first-class citizen as well. Therefore, data has to be treated with the same care that most developers give to the code they write.

Right now in most organizations, data is spread everywhere and inaccessible. This is not just about raw data either-even if an organization spends a lot of money into centralizing their data into lakes, critical data is spread across the organization in colabs, notebooks, scripts and pre-processed flat files. This causes, amongst other things:

Wasted compute on redundant transformations of data
No transparency and accountability of what data trains what models
Inability to transfer important training phase to the serving phase (see below)

Different requirements in training and serving

Teams often find it surprising when a well-trained model starts to give spurious results in the real world. The transition from training a model to serving it is far from trivial.

For examples, there is a skew in training and production data, that needs to be taken into account. Secondly, one has to be very careful in making sure that production data goes through the same preprocessing steps in production as in training. Lastly, while training involves running experiments and quickly iterating, serving has even further requirements on the application level, e.g. inference time and costs at scale. All these need to be taken into account to avoid unnecessary surprises when the transition from training to serving happens.

No gold standard yet for ML Ops

Applying DevOps principles for ML development (or MLOps) is all the rage right now. However, there is yet no gold standard for it. The field in its infancy needs to tackle:

Resources (compute, GPU etc) are scattered and not being used efficiently across teams
No proper CI/CD pipelines
No proper monitoring in production (change in data quality etc.)
Scaling is hard - in training and in serving
Machine learning compute works in spikes, so systems need to be equipped to deal with that

Collaboration is hard

In conventional software development, we use workflows that integrate tickets and version control to make collaboration as seamless and transparent as possible. Unfortunately, ML development still lags behind on this front. This is largely due to the fact that ML developers tend to create silos which include glue-code scripts, preprocessed data pickles, and jupyter notebooks. While all these are useful for research and experimentation, they do not translate well into a robust, long-running, production environment.

In short, in the ML world, there is largely:

Non-transparency coupled with individual experimentation
Notebook Hell with glue-code scripts
No versioning, in data, code or configuration

Conclusion

Most of the problems highlighted above can be solved by proper attention being paid to machine learning development in production, from the first training onwards. The field is catching up, slowly but surely, but it is inevitable that machine learning will catch up with traditional software engineering quickly. Will we see new, even improving, and exciting ML products in our lives at that point? Lets hope so!

Our attempt to solve these problems is ZenML, an extensible, open-source MLOps framework. We recently launched and are now looking for practitioners to solve their problems in production use-cases! So, head over to GitHub, and don't forget to leave us a star if you like what you see!