Patrick Titzler for IBM Developer

Posted on Oct 20, 2020 • Edited on Mar 19, 2021 • Originally published at developer.ibm.com

Automate your machine learning workflow tasks using Elyra and Kubeflow Pipelines

#kubernetes #jupyter #machinelearning #kubeflow

As a data scientist, Jupyter Notebooks are probably one of the most frequently used "tools" to get the job done. Whether you are loading or processing data, analyzing data, using data to train a model, or perform other tasks of the data science workflow, notebooks are usually key.

Let's say you've created a set of notebooks that load, cleanse, and analyze time-series data, which is made available periodically. Instead of running each notebook manually (or performing all tasks in a single notebook, which limits reusability of task specific code) you could create and run a re-usable machine learning pipeline:

With Elyra you can do this in JupyterLab or leverage Kubeflow Pipelines, a popular platform for building and deploying portable, scalable machine learning workflows based on Docker containers.

In Elyra 2.1 (and later releases) you can run pipelines also on Apache Airflow, as outlined in this blog post.

Before I'll outline how you can create and run pipelines, a bit of background information.

JupyterLab can be extended using extensions, making it possible for anybody to customize the user experience by providing new functionality (like a CSV file editor or a visualization), integrating services (like git for sharing and version control), or themes.

Elyra is a set of AI-centric extension for JupyterLab that aim to simplify and streamline day-to-day activities. It's main feature is the Visual Pipeline Editor, which enables you to create workflows from Python notebooks or scripts and run them locally in JupyterLab, on Kubeflow Pipelines, or Apache Airflow.

Assembling a pipeline

Pipelines are assembled in Elyra using the Visual Pipeline Editor. The pipeline assembly process generally involves

creating a new pipeline,
adding Python notebooks or Python scripts and defining their runtime properties, and
connecting the notebooks and scripts to define execution dependencies.

Creating a pipeline

Pipelines are created by opening the Elyra Pipeline Editor from the Launcher.

Adding Python notebooks and scripts to the pipeline

Python notebooks and scripts are added to the pipeline by dragging them from the JupyterLab File Browser onto the canvas.

Each notebook or file is represented by a node that includes an input port and an output port.

Runtime properties, which can be accessed via the context menu, define the execution environment (Docker image) in which the notebook or script is run during remote execution, inputs (file dependencies and environment variables), and output files.

Nodes can optionally be associated with comments to describe their purpose.

Defining dependencies between notebooks and scripts

Dependencies between notebooks or scripts are defined by connecting output ports (on the right side of a node) to input ports (on the left side of a node).

Dependencies are used to determine the order in which the nodes will be executed during a pipeline run.

The following rules are applied:

circular dependencies are not allowed
if two nodes are not connected (directly or indirectly), they can be executed in parallel
if two nodes are connected, the node producing the inputs for the other node is executed first

There are some distinct differences between how pipelines are executed in JupyterLab and on a third-party workflow orchestration framework, such as Kubeflow Pipelines.

Running pipelines in JupyterLab

Pipelines can be executed in JupyterLab as long as the environment provides access to the pipeline's prerequisites. For example, the kernels that the notebooks are associated with must be already installed, just like required packages (if they are not installed in the notebooks).

Running pipelines in the JupyterLab environment should be a viable approach if

you are assembling a new pipeline and are testing it using relatively small data volumes
pipeline tasks don't require hardware resources in excess of what's available
pipeline tasks complete in an acceptable amount of time, given existing resource constraints.

Nodes are executed as a sub-process in the JupyterLab environment and always processed sequentially.

Output files (like processed data files or training artifacts) are stored in the local file system and can be accessed using the JupyterLab File Browser.

Processed notebooks are updated in place, meaning their output cells reflect the execution results.

Python script output, such as messages sent to STDOUT or STDERR are displayed in the JupyterLab console.

Elyra currently does not provide pipeline monitoring capability in the JupyterLab UI aside from a message after processing has completed. However, the relevant information is contained in the JupyterLab console output.

To learn more about how to create a pipeline and run it in JupyterLab take a look at the Running notebook pipelines in JupyterLab tutorial.

Running pipelines on Kubeflow Pipelines

While running pipelines locally might be feasible in some scenarios, it's rather impractical if large data volumes need to be processed or if compute tasks require special purpose hardware like GPUs or TPUs to perform resource intensive calculations.

Elyra can be configured to access Kubeflow Pipeline instances (secured and unsecured) by defining a runtime configuration. Once configured the selected configuration is used to run the pipeline.

The main difference between local pipeline execution and execution on Kubeflow Pipelines is that each node is processed in an isolated Docker container, allowing for better portability, scalability, and manageability.

The following chart illustrates this for two dependent notebook nodes.

Data is exchanged between nodes using S3-compatible cloud storage. Before a notebook or Python script is executed the declared input file dependencies are automatically downloaded from cloud storage into the container. After processing has completed the declared output files are automatically uploaded from the container to cloud storage.

Since Elyra is not yet a mature project it currently relies on the Kubeflow Pipelines UI for pipeline execution monitoring.

Additional details, along with step-by-step instructions can be found in the Running notebook pipelines on Kubeflow Pipelines tutorial.

Ways to try Elyra (and pipelines)

The referenced tutorials are a great way to get started with pipelines. If you are looking for a more complex example this COVID-19 pipeline might fit the bill.

If you'd like to try out Elyra and start building your own pipelines, you have three options, outlined in the sections below:

Use a sandbox environment hosted on the cloud
Use Elyra Docker images
Install JupyterLab and Elyra on your local machine

Kubeflow Pipelines is not included in any of the Elyra installation options.

Running Elyra in a sandbox environment on the cloud

You can test drive Elyra on mybinder.org, without having to install anything. Follow this link to try out the latest stable release or the latest development version (if you feel adventurous) in a sandbox environment.

The sandbox environment contains a getting_started markdown document, which provides a short tour of the Elyra features:

A couple of things to note:

Performance can sometimes be sluggish since this is a shared environment.
The sandbox environment is not persisted and any changes you make will be lost when it is shut down.

If you have Docker installed on your machine consider using one of the Docker images instead.

Running Elyra container images

The Elyra community publishes ready-to-use container images on Docker Hub, which have JupyterLab and the Elyra extension pre-installed. Docker images are approximately one GB in size are tagged as follows:

elyra/elyra:latest is the latest stable release
elyra/elyra:x.y.z has the x.y.z release installed
elyra/elyra:dev is automatically built each time a change is committed.

Once you've decided which image to use (elyra/elyra:latest is always an excellent choice because you won't miss out on the latest features!), you can spin up a sandbox container as follows:

docker run -it -p 8888:8888\
 -v ${HOME}/jupyter-notebooks/:/home/jovyan/work\
 -w /home/jovyan/work\
 elyra/elyra:latest jupyter lab

Open your web browser to the displayed URL and you are ready to start.

To access the notebook, open this file in a browser:
        file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html
    Or copy and paste one of these URLs:
        http://4d17829ecd4c:8888/?token=d690bde267ec75d6f88c64a39825f8b05b919dd084451f82
     or http://127.0.0.1:8888/?token=d690bde267ec75d6f88c64a39825f8b05b919dd084451f82

The caveat is, that in sandbox mode you cannot access existing files (such as notebooks) on your local machine and all changes you make are discarded when you shut down the container.

Therefore it's better to launch the Docker container like so, replacing ${HOME}/jupyter-notebooks/ and ${HOME}/jupyter-data-dir with the names of existing local directories:

docker run -it -p 8888:8888\
 -v ${HOME}/jupyter-notebooks/:/home/jovyan/work\
 -w /home/jovyan/work\
 -v ${HOME}/jupyter-data-dir:/home/jovyan/.local/share/jupyter\
 elyra/elyra:latest jupyter lab

This way all changes are preserved when you shut down the container and you won't have to start from scratch when you bring it up again.

Installing Elyra locally

If your local environment meets the prerequisites, you can install JupyterLab and Elyra using pip, conda, or from source code, following the instructions in the installation guide.

Closing thoughts

Elyra is a community-driven effort. We welcome contributions of any kind: bug reports, feature requests, and of course pull requests. Head on over to https://github.com/elyra-ai/elyra and let's get started.

Originally published on IBM Developer.

DEV Community