DEV Community: Alex Strick van Linschoten

How to improve your experimentation workflows with MLflow Tracking and ZenML

Alex Strick van Linschoten — Thu, 24 Feb 2022 15:41:46 +0000

Most professional or so-called 'citizen' data scientists will be familiar with the scenario that sees you spending a day trying out a dozen different model training configurations in which you experiment with various hyper parameters or perhaps different pre-trained models. As evening falls, you emerge from the haze of experimentation and you ask yourself: which of my experiments offered the best results for the problem I'm trying to solve?

At this point, especially for smaller use cases or where you were unsure if a hunch was worth pursuing and just wanted to try a few things out, you might be left empty-handed, unable to give an answer one way or another beyond some hunch that there was one set of parameters that really performed well, if only you could remember what they were… And if someone asked you to reproduce the steps it took you to create a particular model, would you even be able to do that?

This would be one of those times where it's worth reminding ourselves that data science includes the word 'science', and that we need to be careful around how we track and reason about models. The workflows and practice of machine learning is sufficiently complicated (and often non-deterministic) that we need rigorous ways of ensuring that we really are doing what we think we are doing, and that we can reproduce our work. (It's not for nothing that 'reproducibility' is often paired with 'crisis'.)

There are manual ways that you could use to help address this problem, but they're unlikely to be sufficient. Will your spreadsheet experiment tracker really capture everything you needed to produce a particular model? (Think about how the particular configuration or random split of data is so central to how your model performs.) What you really want is something that will handle all this tracking of data and parameters, in as automatic a way as is possible.

Why use MLflow Tracking?

Enter, MLflow Tracking, part of a wider ecosystem of tooling offered by MLflow to help you train robust and reproducible models. Other commonly-used pieces are the model registry (which stores any model artifacts created during the training process) as well as their flexible suite of plugins and integrations allowing you to deploy the models you create.

MLflow Tracking is what allows you to track all those little parts of your model training workflow. Not only does it hook into an artifact store of your choosing (such as that offered by ZenML), but it offers a really useful UI interface which you can use to inspect pipeline runs and experiments you conduct. If you want to compare the performance or accuracy of several experiments (i.e. pipeline runs), some diagrams and charts are only a few clicks away. This flexible interface goes a really long way to solving some of the problems mentioned earlier.

One really useful feature offered by MLflow Tracking is that of automatic logging. Many commonly-used machine learning libraries (such as scikit-learn, Pytorch, fastai and Tensorflow / Keras) support this. You either call mlflow.autolog() just before your training code, or you use a library-specific version of that (e.g. mlflow.sklearn.autolog()). In this way, MLflow will handle logging metrics, parameters and models without the need for explicit log statements. (Note that you can also include the non-automated logging of whatever custom properties are important for you.

ZenML + MLflow Tracking = 🚀

If you're using ZenML to bring together the various tools in your machine learning stack, you'll probably be eager to use some of this tracking goodness and make your own experiments more robust. ZenML actually already partly supported what MLflow Tracking does in the sense that any artifacts going in or out of the steps of your ZenML pipeline were being tracked, stored and versioned in your artifact and metadata store. (You're welcome!) But until now we didn't have a great way for you to interact with that metadata about your experiments and pipeline runs that was non-programmatic and also visual.

MLflow Tracking gives you that ability to inspect the various experiments and pipeline runs in the (local) web interface and is probably going to be a friendlier way of interacting with and reasoning about your machine learning experiments.

You could have used MLflow Tracking in the past, too, but with our latest integration updates ZenML handles some of the boilerplate complicated setup that comes with using MLflow. There are different ways of deploying the tracking infrastructure and servers and it isn't a completely painless task to set all this up and to get going with MLflow Tracking. This is where we make your life a bit easier: we setup everything you need to use it on your (currently: local) machine, connecting the MLFlow Tracking interface to your ZenML artifact store. It can be a bit tricky to configure the relevant connections between the various modular pieces that talk to each other, and we hide this from you beneath an abstraction.

We think that this ability to converse between the MLflow universe and the ZenML universe is extremely powerful, and this approach is at the heart of what we are trying to build with our tool to help you work with reproducible and robust machine learning pipelines.

Just tell me how to use it already!

The best place to see MLflow Tracking and ZenML being used together in a simple use case is our example that showcases the integration. It builds on the quickstart example, but shows how you can add in MLflow to handle the tracking. In order to enable MLflow to track artifacts inside a particular step, all you need is to decorate the step with @enable_mlflow and then to specify what you want logged within the step. Here you can see how this is employed in a model training step that uses the autolog feature I mentioned above:

# Define the step and enable mlflow - order of decorators is important here
@enable_mlflow
@step
def tf_trainer(
    config: TrainerConfig,
    x_train: np.ndarray,
    y_train: np.ndarray,
) -> tf.keras.Model:
    """Train a neural net from scratch to recognize MNIST digits return our
    model or the learner"""
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(10),
        ]
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(config.lr),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"],
    )

    mlflow.tensorflow.autolog()
    model.fit(
        x_train,
        y_train,
        epochs=config.epochs,
    )

    # write model
    return model

If, for any reason, you need to access the global environment parameters used by ZenML to automatically configure MLflow (which define where and how experiments and runs are displayed and stored in the MLflow Tracking UI/system), we've got you covered. These global parameters can be easily accessed through the Environment singleton object:

zenml.integrations.mlflow.mlflow_environment import MLFLOW_ENVIRONMENT_NAME
mlflow_env = Environment()[MLFLOW_ENVIRONMENT_NAME]

Check out the API docs to learn more about the Environment object and watch this space for a blog post where we explain more about why we chose to add this recently.

Over to you now!

If you're inspired by this illustration of how you can make your machine learning workflow that little bit more reproducible and robust, check out the full example that illustrates the integration. If you use it in your own code base, please do let us know — say hi on Slack! — and as always if you have any questions, we're here for you.

10 Ways To Level Up Your Testing with Python

Alex Strick van Linschoten — Wed, 10 Nov 2021 12:01:59 +0000

There's nothing like working on testing to get you familiar with a codebase. I've been working on adding back in some testing to the ZenML codebase this past couple of weeks and as a relatively new employee here, it has been a really useful way to dive into how things work under the hood.

This being my first time working seriously with Python, there were a few things that I had to learn along the way. What follows is an initial set of lessons I took away from the experience.

1. One-size-fits-all won't cut it

Looking at things from a higher level, it's important to realise that there are lots of different approaches that you could take to testing. It's a truism that you should 'test intent, not implementation', but I imagine that in some scenarios like for software being deployed on a space shuttle you'd want to maybe also test the implementation as well.

Similarly, different companies and projects have different needs for testing. If you're a huge company, testing is a way of ensuring reliability and preventing catastrophic failures along the way. If you're a small company, where speed of creation and the pace of development is frantic, having too rigid a set of tests may actually end up hurting you by stifling your ability to iterate through ideas and changes quickly.

I found it helped to take a step back early on in my testing to really think through what I was doing, why I was doing it, and what larger goal it was there to support.

2. 'Don't be that person': testing to crush the spirits of your team

It's worth reiterating the previous remark about testing intent and not implementation.

If you test every last conditional statement, checking that the code is built in exactly that specific way, changing anything in the original codebase is going to become incredibly tiresome. Moreover, your testing library will start to resemble a kind of byzantine twin replica of your original code.

For preventing this, it helps if everyone in the team is testing as much as they are writing new code. This way it is just part of the development process and not a separate add-on from a QA-like team. At ZenML, we're small enough that the expectation is that if you work on a new feature, you should also be responsible for writing the tests that go alongside.

3. Pytest, O Pytest!

Pytest is amazing. It has everything you need to write your tests, is easy to understand, and has great documentation of even the slightly more niche features. Can you tell I really enjoyed getting to know this open-source library?

For now, I'll mention some of the really useful combinations of CLI commands that I found useful.

# make the test output verbose
pytest tests/ -v

# stop testing whenever you get to a test that fails
pytest tests/ -x

# run only a single test
pytest tests/test_base.py::test_initialization

# run only tests tagged with a particular word
pytest tests/ -m specialword

# print out all the output of tests to the console
pytest tests/ -s

# run all the tests, but run the last failures first
pytest tests/ --ff

# see which tests will be run with the given options and config
pytest tests/ —collect-only

# show local variables in tracebacks
pytest tests/ —showlocals

And there are so many more! The flexibility of the CLI tool allows you to be really nimble and ensures you don't have to hang around for already-passing tests to run.

4. Temp Files & Temp Directory Choice Paralysis

At a certain point I needed to test that certain functions were having side effects out in the real world of a filesystem. I didn't want to pollute my hard drive or that of whatever random CI server was running the tests, so then I started looking around for options for the creation of temporary files and directories.

It turns out that between the Python standard library, Pytest and some library-specific features, we're spoiled for choice when it comes for convenience helpers to create temporary files and directories. Python has tempfile which is a platform-agnostic way of creating temporary files and directories. Pytest has tmp_path which you can insert as an argument into your test function and have a convenience location which you can use to your heart's content. (There are also several other options with Pytest). Then other libraries you're using may have specific testing capabilities. We use click for our CLI functionality and there's a useful convenience pattern for running commands from a temporary directory:

def test_something():
   runner = CliRunner()
   with runner.isolated_filesystem():
      # do something here in your new temporary directory

5. Decorate your way to clearer test code

Pytest has a bunch of helper functions which enhance the test code you already have. For instance, if you want to wanted to iterate over a series of values and pass them in as arguments to a function, you can just use the parametrize functionality:

@pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
def test_eval(test_input, expected):
    assert eval(test_input) == expected

Note that this would fail because 6x9 does not equal to 42.

If you have a test that you know is failing right now, but you want to put it to the side for the moment, you can mark it down as being expected to fail with xfail:

@pytest.mark.xfail()
def test_something() -> None:
    # whatever code you have here doesn't work

I find it's more useful in this way to get a full sense of which tests aren't working rather than just commenting them out.

The mark method in general is a great way of creating some custom ways to run your tests. You could — using a @pytest.mark.no_async_call_required decorator — distinguish between tests that take a bit longer to run and tests that are more or less instantaneous, for example.

6. Use `hypothesis` for random arguments

Hypothesis is a Python library to check that functions work the way you think they do. It works by setting up certain conditions under which the function should work.

For example, you can say that this function should be able to accept any datetime value without any problem. Instead of trying to come up with a list of different possible edge cases, hypothesis instead will run (in parallel) a whole series of values to check that this is actually the case. As the docs state:

"It works by generating arbitrary data matching your specification and checking that your guarantee still holds in that case. If it finds an example where it doesn’t, it takes that example and cuts it down to size, simplifying it until it finds a much smaller example that still causes the problem. It then saves that example for later, so that once it has found a problem with your code it will not forget it in the future." (source)

These custom ways of testing certain kinds of inputs are called 'strategies', and it has a whole bunch of these to choose from. The ones I most often use are text, integers, decimals and datetime.

7. Use `tox` to test multiple versions of Python

tox allows you to automate running your test suite through multiple versions of Python. It's likely that your CI process does this as well, so in order to test that these are passing locally as well, you can use tox. It creates new virtual environments using the versions you specify and runs your test suite through each of them.

Note that if you're using pyenv as your overall Python version manager, you may have to use something like the following command to make sure that all the various Python versions are available to tox:

pyenv local zenml-dev-3.8.6 3.6.9 3.7.11 3.8.11 3.9.6

The first argument passed in is my development environment in which I usually work, but the other Python versions / environments are to make those versions available to tox.

8. Debug your failing tests with `pdb`

Pytest has a bunch of handy ways of inspecting exactly what's going on at the point where a test fails. I showed some of those above, where you can show, for example, whatever local variables were initialized alongside the stacktrace.

Another really useful feature is the --pdb flag which you can pass in along with your CLI command. This will deposit you inside a pdb debugging environment at exactly the moment your test fails. Super useful that we get all this convenience functionality out of the box with Pytest.

9. Linting: before and beyond testing

At ZenML we use pre-commit hooks that kick into action whenever you try to commit code. (Check out our pyproject.toml configuration and our scripts/ directory to see how we handle this!) It ensures a level of consistency throughout our codebase, ensuring that all our functions have docstrings, for example, or implementing a standard order for import statements.

Some of this — the mypy hook, for example — starts to verge into what feels like testing territory. By ensuring that functions all have type annotations you sometimes are doing more than just enforcing a particular coding style. When you add mypy into your development workflow, you get up close and personal with exactly how different types are passed around in your codebase.

10. …and remember, coverage is just a number!

It's always good to have a number to chase. It gives you something to work towards and a feeling of progress. Tools like Codecov offer fancy visualizations of just which parts of your codebase still need some attention. Automating all this as part of the CI process can highlight when you've just added a series of features but no accompanying tests.

Bearing all these positives in mind, you should still always remember that your tests are there to serve your broader goals. If your goal is to rapidly iterate and create new features, maybe having a goal of 100% test coverage at all times is an unrealistic expectation. A 100% test coverage does not necessarily mean your code is bug-free and robust. It just means that you invoked it during the testing process.

Similarly, different kinds of codebase will have different kinds of test weightings. We didn't really talk much about the different types of tests (from unit to integration to usability), but some systems or types of designs will require more focus on different pieces of this bigger picture.

Alex Strick van Linschoten is a Machine Learning Engineer at ZenML.