Jesse Williams for KitOps

Posted on Jun 4 • Edited on Sep 17 • Originally published at jozu.com

A step-by-step guide to building an MLOps pipeline

#machinelearning #beginners #devops #learning

The failure of DevOps pipelines to control Machine Learning(ML) development workflows gave rise to MLOps pipelines. These workflows require iterative interactions and management of numerous components and processes in development and production. MLOps pipelines tick these boxes with their ability to manage workflows that convene multiple pipelines. In contrast, DevOps can only handle workflows through a single pipeline.

Nonetheless, substantial development and integration complexities accompany the MLOps approach of using multiple pipelines. Simplifying the multi-pipeline approach in MLOps to be more solitary, like the DevOps pipelines, can mitigate those complexities and make ML workflow execution less challenging.

This blog post walks you through the steps and processes of building an MLOps pipeline. Then, it introduces you to an open source tool that allows you to package models and model assets as an OCI-compatible artifact that seamlessly integrates with your existing DevOps pipeline. With this approach, you can avoid the difficulty of maintaining multiple pipelines and continue using the tools you already have in place.

What is an MLOps pipeline?

An MLOps pipeline automates the interconnected set of processes in the machine learning lifecycle. It addresses the gaps in the smooth transition from development to production through automation. In hindsight, it streamlines ML components with other system components and workflows.

Existing MLOps tools provide methods to automate, orchestrate, and manage ML workflows within MLOps pipelines, but they each have unique mechanisms for delivering their solutions. This lack of standard delivery formats and other inherent ML system shortcomings, such as dynamic data and model dependencies, hidden feedback loops, and workflow entanglements, can make it challenging to manage and integrate the MLOps pipeline with the rest of the system.

Building an MLOps pipeline

The maturity level of your MLOps pipeline significantly influences its robustness, design, and architecture. High maturity levels often imply more streamlined workflows and less friction at the operation and production handoff points. However, optimizing MLOps pipelines has less to do with maturity level, and more to do with the ML workflows and tools of choice.

A combination of sub-pipelines (data, model, and code pipelines) makes up the MLOps pipeline. Managing each of these sub-pipelines and their intertwined processes—data validation, model versioning, experiment tracking, model serving, monitoring, and retraining—makes managing the MLOps pipeline a tedious endeavor without automation. The MLOps pipeline comprises two phases, which include:

Development or experimentation phase
Automation and deployment phase

Phase 1: Development or experimentation phase

The MLOps pipeline typically starts in a notebook where you build and optimize an ML model that gives you an empirical solution to your problem based on patterns learned from training data. This phase commonly involves two processes:

Data preparation
Model development and training

Step 1: Data Preparation

Data preparation is the bedrock of ML development. It entails tailoring your data to suit your ML task. The processes involved in data preparation and transformation include:

Data ingestion
Data cleaning/preprocessing
Feature engineering

Data ingestion: You must have reliable data sources that align with the ML project's goals. Typical data sources usually include databases (data warehouses), feature sources, or APIs. ML data comes in different forms, some structured, like tabular data, and others unstructured, like images and videos. The type of ML task also determines the best data preprocessing and feature engineering decisions.

Data preprocessing: Ingested data is rarely in the best condition for you to train your model. Your data should be comprehensive and unbiased to avoid skewed model outcomes. The type of ML data determines the best data preprocessing to employ. For instance, standard techniques for preprocessing tabular data include normalizing numerical features to ensure they have a similar scale or encoding categorical features into numerical representations. Preprocessing contributes to improving model convergence and performance.

Feature engineering: To capture relevant aspects of the problem, you have to create new data features or transform existing ones to suit the ML task. The feature increases your chance of producing trained models with good performance. Some techniques include aggregating features, removing irrelevant features, or leveraging domain knowledge to extract valuable insights.

Step 2: Model development and training

Combining the data preparation steps forms a data transformation pipeline that automatically produces the training data when you apply your data from the source. This ensures consistent training data without affecting the source data and allows you to focus on the model training pipeline.

Training an ML model is the most experimental and iterative process of the ML development lifecycle. The aim is to identify the best model by conducting several training experiments on the transformed data using various model architectures and hyper-parameters.

Experiment tracking tools like MLflow, Weights and Biases, and Neptune.ai provide a pipeline that automatically tracks meta-data and artifacts generated from each experiment you run. Although they have varying features and functionalities, experiment tracking tools provide a systematic structure that handles the iterative model development approach.

Experiment tracking pipelines make comparing different model versions, hyper-parameters, and results easier with the model evaluation metric. After evaluating the experiments, the pipeline sends the best-performing models and artifacts to the model repository for deployment.

Phase 2: Automation and deployment phase

This phase focuses on streamlining the process of delivering machine learning models from the development phase into a production pipeline in a reliable, scalable, and repeatable manner. It also aims to facilitate collaboration between data scientists, engineers, and operations teams through standardized processes and tools. It comprises the following:

Version control and reproducibility
Deployment and infrastructure
Continuous Integration, Continuous Training, and Continuous Delivery/Deployment (CI/CT/CD)
Monitoring and performance tracking

Step 3: Version control and reproducibility

Version control enables you to build reproducible, reliable, and auditable MLOps pipelines that deliver high-quality machine learning models. However, if done incorrectly, versioning can get messy since pipelines populate the code, data, and model in different locations.

Versioning ushers in the operational layer of the MLOps pipeline and facilitates team collaboration during development. The pipelines are technically source codes that produce data and models. Versioning the code pipelines with a version control system like Git makes establishing the lineage with the model and data easy. This is because after running the transformation and model training pipeline, the models can be stored in a model registry, and the data in a feature store.

The meta-data and model artifacts from experiment tracking can contain large amounts of data, such as the training model files, data files, metrics and logs, visualizations, configuration files, checkpoints, etc. In cases where the experiment tool doesn't support data storage, an alternative option is to track the training and validation data versions per experiment. They use remote data storage systems such as S3 buckets, MINIO, Google Cloud Storage, etc., or data versioning tools like data version control (DVC) or Git LFS (Large File Storage) to version and persist the data. These options facilitate collaboration but have artifact-model traceability, storage costs, and data privacy implications.

Step 4: Deployment and infrastructure

ML development and deployment pipelines have different requirements that dictate the choice of infrastructure, deployment targets, and strategies. For instance, the data transformation pipeline in development may require distributed processing but not in production. Model pipelines require GPUs in development due to training and just a CPU in production for inference. However, skill level also drives the choice of deployment targets and strategies.

ML Deployment pipelines usually consist of a model, data transformation and validation pipelines, endpoints, and other dependencies. Therefore, it is essential to ensure that the deployment configuration and environment are light and efficient. This concern makes packaging with containers ideal for abstracting deployment complexities.

After packaging them for production, it is easier to scale them for your deployment mode on your infrastructure setup. The ML deployment mode is either online or offline. In offline or batch inference, the model is only sometimes up and running, so infrastructure pipelines don't prioritize auto-scaling, low latency, and availability. In the case of online deployment, the model is always up and running and is deployed either as a web or streaming service. Therefore, the infrastructure setup must mitigate concerns about auto-scaling, low latency, and high availability.

Step 5: Continuous Integration, Continuous Training, and Continuous Delivery/Deployment (CI/CT/CD)

Manually executing source code changes, data validation, model retraining, model evaluation, and deployment workflows for data and models is tiring and error-prone. The ML components interact in such a way that changes occurring in any of them affect the rest. For example, if there is a feature update in the data transformation pipeline, the model weights will also change, producing a new model.

MLOps slightly modifies the traditional DevOps CI/CD practice with an additional pipeline called continuous training (CT). The CI/CT/CD pipeline for MLOps involves orchestrating a series of automated steps to streamline the development, training, testing, and deployment of machine learning models. Automating these processes enables efficient model deployment. Standard automation tools include Jenkins, GitLab CI, Travis CI, and GitHub Actions. You will typically set up the MLOps CI/CT/CD pipeline using a trigger for the automation strategy:

Continuous Integration (CI) focuses on automating the process of testing and validating changes to the codebase. The trigger begins with new commits or merges for code changes, new model versions, a data transformation pipeline, a training script, and configuration files. It then builds a project environment and runs unit tests to verify code quality and correctness. The model is then validated on a validation dataset to ensure it meets the desired criteria. Finally, the model artifacts, including model files, dependencies, and metadata, are pulled from the model registry and packaged into a deployable format, such as a Docker container.

Continuous Delivery / Continuous Deployment (CD) pipeline deploys the trained model to a staging environment for further testing and validation under near-production conditions. These encompass comprehensive tests, including performance, load, and stress testing. Then, it is approved and deployed to the production pipeline.

Continuous Training (CT) pipeline triggers retraining using the feedback extracted from the deployed model and production data monitoring logs. The CT automation strategy initiates triggers based on specific criteria, such as new user data availability, model performance degradation, or scheduled intervals. It validates the latest data with the expected schema and values, then retrains and validates the model.

Step 6: Monitoring and performance tracking

Model monitoring is crucial for maintaining the performance and reliability of machine learning models in production. It involves tracking key model performance metrics such as accuracy, precision, recall, etc., to identify deviations from expected behavior, detect data and model drift, and ensure consistent model performance. This process provides actionable insights into the model's real-world impact and helps you make informed decisions about retraining or updating models.

You specify thresholds and metrics to detect anomalies and potential issues from data quality, feature distributions, and model output discrepancies. Regular monitoring and analysis empower you to proactively address performance degradation, ensuring that machine learning models continue to deliver accurate predictions and drive business value. The holistic assessment for tracking model performance differs from business metrics. Therefore, effective model monitoring relies on specialized tools and techniques. Some of these techniques include:

Use techniques like SHAP or LIME to interpret model predictions and understand influencing factors.
Monitor changes in input data distribution using statistical tests or drift detection algorithms.
Track changes in feature-target relationships using methods like comparing predictions or monitoring feature importance.

Introducing KitOps–An open source solution to ease MLOps workflows

One of the main reasons teams struggle to build and maintain their MLOps pipelines are vendor specific packaging. As a model is handed off between data science teams, app development teams, and SRE/DevOps teams, the teams are required to repackage the model to work with their unique toolset. This is tedious, and stands in contrast to well adopted development processes where teams have standardized on the use of containers to ensure that project definitions, dependencies, and artifacts are shared in a consistent format. KitOps is a robust and flexible tool that addresses these exact shortcomings in the MLOps pipeline. It packages the entire ML project in an OCI-compliant artifact called a ModelKit. It is uniquely designed with flexible development attributes to accommodate ML workflows. They present more convenient processes for ML development than DevOps pipelines. Some of these benefits include:

Simplified versioning and sharing of large, unstructured datasets, making them manageable.
Synchronized data, model, and code versioning to mitigate reproducibility issues during the AI/ML development.
Packaging ML key components in standard formats that enhance compatibility and efficient deployment.
Openness and interoperability to avoid vendor lock-in by leveraging the OCI standards (i.e., a format native to container technology).

If you've found this post helpful, support us with a GitHub Star!

Use DevOps pipelines for MLOps with KitOps.

Existing ML/AI tools focus on providing unique AI project solutions that integrate back into the MLOps pipeline. Among its goals, a tool like KitOps focuses on packaging your AI project solutions using open standards, so all your AI project artifacts are compatible with your existing DevOps pipeline.

DevOps pipelines utilize container technologies to streamline development workflows by promoting a code-centric approach and enabling reuse across environments. These technologies have become industry standards, facilitating seamless integration, version control, collaboration, and efficient rollbacks for streamlined development processes.

Efficiently adapting them for ML workflows may present greater benefits than managing multiple pipelines for your code, data, and model workflows. KitOps OCI-compliant standards allow you to integrate your ML workflow seamlessly into your existing DevOps pipelines.

Package MLOps pipeline step with KitOps

Let's say your team has an existing DevOps pipeline and is ready to execute and scale an AI initiative. Transitioning your DevOps pipelines and principles into an MLOps one to develop the AI solution has certain implications, such as:

Increased complexity
Integration challenges
Lack of standardization
Skillset gap
Higher costs
Knowledge silos
Increased time to market (TTM)

Instead, you can simultaneously leverage KitOps’ workflow to develop and collaborate on your ML components from a ModelKit package. ModelKit abstracts the implementation of the ML development pipeline. While maintaining standards, you will develop and manage all your ML components (i.e., code data and models) within the exact location. To use a ModelKit for your MLOps workflow, you only need to:

Install and set up KitOps for your operating systems (OS).
Configure a container registry such as Jozu Hub, Docker hub or GitHub container registry.

Creating a ModelKit for your AI project requires you to initialize its native configuration document called a Kitfile. Create a Kitfile for your ModelKit in your development directory.

touch Kitfile

Open the Kitfile and specify all the folders relevant to your ML development so the Kitfile knows what to track when you package your ModelKit image. This lets you maintain the structure and workflows in your local development environment. The Kitfile is a YAML file, here is a sample.

manifestVersion: v1.0.0
package:
  author: 
  - Nwoke
  description: This project is used to predict the quality of red wine
  name: WineQualityML

code:
- description: Jupyter notebook with model training code in Python
  path: ./code

datasets:
- description: Red wine quality dataset (tabular)
  name: data
  path: ./datasets

model:
  framework: Tensorflow
  name: Wine_model
  path: ./models
  version: 0.0.1
  description: Red wine quality model using Scikit-learn

Then, package the Kitfile in your development directory with an AI project name and tag for the current development stage to create the ModelKit image in your local KitOps registry. The version and tag workflows enable consistency across code, data, and model since they all exist in one location.

kit pack . -t "MODELKIT_IMAGE_NAME":"TAG"

The ModelKit automatically tracks and stores updates of the directories you specified in the Kitfile. At every development stage, you only need to repackage the ModelKit to track updates. Packaging your ModelKit from the development phase minimizes errors, ensures uniform practices, and even enables easy rollback if needed.

After packaging locally, you can push your ModelKit's image from the local repository to your configured remote container registry. But first, you have to tag the Modelkit using the name of your configured remote registry to create clear reference and destination for the image in the remote registry.

Tag your ModelKit with the remote registry reference:

kit tag "SOURCE_MODELKIT":"TAG" "TARGET_MODELKIT":"TAG"

then push to your remote container registry.

kit push "REMOTE_REGISTRY"/"REPOSITORY_USERNAME"/"MODELKIT_IMAGE_NAME":"TAG"

Now, developers can reproduce your ML workflows or extract only relevant assets for further development, testing, integration, or deployment rather than collaborating on MLOps pipeline components with different locations due to unsupported formats between code, data, and models.

ModelKit assets have an optimal caching system. When you pack or unpack, they create a reference link to their model assets. Therefore, if you or other developers already have a copy of your model assets locally, it just uses the reference and doesn't repack or unpack the asset. This referencing ability avoids duplication of model assets and makes your ModelKit fast and light even when using assets with large sizes. It also supports common MLOps tools, so you can use your local development environment as the persistent location for your artifacts, metadata, etc.

Anyone with container registry access can use your ModelKit image in the remote container registry for further development or deployment. They can directly pull the entire ModelKit.

kit pull "REMOTE_REGISTORY"/"REPOSITORY_USERNAME"/"MODELKIT_IMAGE_NAME":"TAG"

Or extract the model, data, code, or artifacts they need into their development or production environment.

kit unpack . "REMOTE_REGISTORY"/"REPOSITORY_USERNAME"/"MODELKIT_IMAGE_NAME":"TAG"

Without development pipelines, all your ML components are in one remote location, with versions at different development stages. The ML development pipeline becomes an image in a remote container registry.

At this point, you can implement a deployment strategy that extracts these components from the ModelKit into the staging or production environment when you push a new model kit. One recommended approach here is to use KitOps tags to automate triggers for your production workflow.

KitOps' ModelKits lets you seamlessly integrate your machine learning workflows into your existing DevOps pipeline. Packaging models, code, data, artifacts, etc., into a portable, OCI-compatible artifact eliminates the need to use ML development pipelines while leveraging familiar tools and processes. This unified approach simplifies AI development and deployment. It also accelerates the delivery of AI-powered applications, allowing you to focus on innovation rather than pipeline management. Embrace the power of ModelKits and unlock the full potential of your MLOps initiatives - try KitOps today!

Top comments (5)

Yasir Rehman • Jun 8

Interesting article, thanks for sharing it. :)

Ayush Thakur • Jun 6

kitOps sounds super interesting. Will try it out

Jesse Williams • Jun 6

Thanks! We hope you enjoy it and find value

Aravind Putrevu • Jun 4

Interesting, how it differs from Kubeflow?

Gorkem Ercan • Jun 4

KitOps simplifies MLOps by integrating ML workflows into existing DevOps best practices using a single, standardized artifact, reducing complexity and enhancing compatibility. Kubeflow, on the other hand, offers a comprehensive, Kubernetes-native platform for managing multiple ML pipelines and components. While comparing Kubeflow directly with KitOps may not be entirely appropriate, KitOps’ flexibility allows it to be used alongside different tools and methods for implementing ML pipelines, including Kubeflow.

DEV Community