How we track our todo comments using GitHub Actions

Michael Schuster — Wed, 01 Dec 2021 12:37:43 +0000

If you're a software developer, you're probably familiar with the following scenario: You're working on a new feature or trying to fix a bug, and while reading through some code existing code you notice that there's a nicer way to write it, or maybe a potential edge case isn't handled.
But where to go from here? Write a todo comment and let your future self handle it of course!

While this might not be the optimal solution, I still regularly use todo comments if the fix is too complicated to implement right away as I find it can get quite distracting to repeatedly switch to my browser and create an issue with a meaningful description.

How to keep todo comments in sync with Jira issues

This however brings a problem with it: these todos are separated from our Jira board so we did not take them into account when planning our sprints.
Keeping the comments in code in sync with our Jira issues manually would require a considerable amount of effort. We would have to periodically go over the entire codebase and create issues for new todos as well as delete issues and todos if their counterpart was removed.
Instead, we looked at multiple GitHub integrations in the Jira marketplace but couldn't find an existing solution with similar features, so we decided to implement a GitHub Action that helps us track todos automatically.

GitHub Actions to the rescue

Each time something is pushed to the main branch, a GitHub workflow is triggered which simply calls a python script to do the heavy lifting.
The script itself uses the following regular expression to find todo comments in our python files:

pattern = r"(^[ \t]*#) TODO ?\[(LOWEST|LOW|MEDIUM|HIGH|HIGHEST|[A-Z]*?-[0-9]*?)\]:(.*$\n(\1 {2}.*$\n)*)"

Don't worry I won't bore you with the details of how this expression works, but it essentially means that our todo comments have to conform to a certain syntax (a comment starting with a capital TODO followed by a priority in square brackets and a colon) in order for the script to detect them.
Once all syntactically correct todos are found, they are processed as follows:

Create issues for new todos: Each time new code gets merged into the main branch of our repository, our script detects all new todos and creates Jira issues with the specified priority and description. The created issues include a github link to the actual comment for more context and are tagged with a separate label so we can quickly find them later. Additionally, we modify the comments to include a reference to the created issue which is not only used to avoid creating duplicated issues but also comes in quite handy if you come across a comment and want to for example check if there's already someone working on it.
```
# before
# TODO [HIGH]: Do something very important here

# after
# TODO [ENG-123]: Do something very important here
```
Delete todos for closed issues: Our codebase is evolving quite quickly at the moment and we closed some obsolete issues from time to time. To automatically keep the todo comments and issues in sync, the script also deletes todo comments when the corresponding issue was closed.
Tag issues when a todo is deleted: Now there is just one case left to handle: what if a todo comment gets deleted and the corresponding issue is still open? We decided to handle this with caution and not close the issue automatically to guard against accidentally deleted comments. Instead, our script adds a separate label to these "orphan" issues so we can easily discuss whether they should actually be closed during our planning meetings.

If you're interested in more details or having something similar in your projects, check out the script and the accompanying GitHub workflow.

Michael Schuster is a Machine Learning Engineer at ZenML.

Introducing the revamped ZenML 0.5.x

Michael Schuster — Wed, 17 Nov 2021 14:03:32 +0000

We've been hard at work for the last few months to finalize the 0.5.0 release and we're super excited to finally share some details regarding this all-new ZenML version with you!

We'll go over the main new features in this blog post but if you're looking for a detailed list make sure to take a look at our release notes.

Completely reworked API

If you're familiar with previous versions of ZenML, you'll be in for a huge surprise.
No more tedious subclassing for every step in your machine learning pipeline, the new ZenML functional API allows you to simply decorate your existing functions in order to run them in a ZenML pipeline.
As long as the inputs and outputs of your functions are part of the continuously expanding set of supported datatypes, ZenML automatically takes care of serializing and deserializing your step outputs.
And if a datatype is currently not supported, ZenML enables you to easily create a custom materializer to run your code anyway.

Let's take a look at a simple step that normalizes images for training to see how the new API looks in practice:

@step
def normalize(images: np.ndarray) -> np.ndarray:
    """Normalize images so the values are between 0 and 1."""
    return images / 255.0

Notice the @step above the normalization function? That's all that was needed to transform this into a ZenML step that can be used in all your pipelines.
Now all that's left to do is creating a pipeline that uses this step and running it:

@step
def load_data() -> np.ndarray:
    ...

@pipeline
def load_and_normalize_pipeline(
    load_data_step,
    normalize_step,
):
    # Connect the inputs and outputs of our pipeline steps
    images = load_data_step()
    normalize_step(images=images)

# Create and run our pipeline
load_and_normalize_pipeline(load_data(), normalize()).run()

Our quickstart and low-level guide are the perfect place if you want to learn more about our new API.

Stacks

Stacks are one of ZenMLs new core concepts. A stack consists of three components that define where to store data and run ZenML pipelines:

A metadata store: Stores metadata like pipeline names and parameters used to execute steps of a pipeline.
An artifact store: Stores output data of all steps executed as part of a pipeline.
An orchestrator: Executes a pipeline locally or in a cloud environment.

The diagrams below show two exemplary stacks and their components:


Figure 1: Example stacks for local development (left) and production using Apache Airflow and GCP (right)

While the development stack uses your local machine to execute pipelines and store data, the production stack runs pipelines using Apache Airflow and stores their resulting data in GCP.
In future versions of ZenML we will integrate many popular tools for each of these components so you can easily create stacks that match your requirements.

After setting up multiple stacks for development and production, it is as easy as calling

  zenml stack set production_stack

to switch from executing pipelines locally to running them in the cloud!
Check out our low-level guide to learn more about the remaining core concepts or skip straight to chapter 7 to see the magic of stacks in action.

New post-execution workflow

Inspecting and comparing pipelines after they were executed is an essential part of working with machine learning pipelines.
That is why we've added a completely new post-execution workflow that allows you to easily query metadata like the parameters used to execute a step and read artifact data like the evaluation accuracy of your model.
This is how it works:

# Get a pipeline from our ZenML repository
pipeline = Repository().get_pipeline(pipeline_name="my_pipeline")
# Get the latest run of our pipeline
pipeline_run = pipeline.runs[-1]
# Get a specific step of the pipeline run
evaluation_step = pipeline_run.get_step(name="evaluation_step")

# Use the step parameters or outputs
class_weights = evaluation_step.parameters["class_weights"]
evaluation_accuracy = evaluation_step.output.read()

In future versions, this will be the basis on which we will build visualizations that allow you to easily compare different runs of a pipeline, catch data drift and so much more!

Type hints

Starting with version 0.5.1, ZenML now has type hints for the entire code base!
Apart from helping us make the codebase more robust, type hints in combination with unit tests allow us to implement new features and integrations quickly and confidently.
Type hints also increase code comprehensibility and improve autocompletion in many places so working with ZenML is now even easier and quicker!

What lies ahead

It has been a huge undertaking to rework the entire ZenML API but we're super happy with how it turned out (join our Slack to let us know if you agree or have some suggestions on how to improve it)!

There are however a few features that are still missing from previous versions of ZenML, but now that we have a solid foundation to work on it should be a quick process to reintegrate them. So keep your eyes open for future releases and make sure to vote on your favorite feature of our roadmap to make sure it gets implemented as soon as possible.