Jesse Williams for Jozu

Posted on Aug 26 • Originally published at jozu.com

Scalable ML Deployments Made Simple with KitOps and Kubernetes (No Hardware Required)

#programming #ai #tutorial #devops

Introduction

Machine learning model deployment often hits roadblocks when moving between environments. Version mismatches, file structure changes, and environment differences can derail even the best-planned deployments.

KitOps (a CNCF project backed by Jozu) offers a solution called ModelKits, which is a standardized artifact that creates a declarative package of an ML model with its dependencies and configuration. This open-source toolkit lets organizations, developers, and data scientists bundle their models (manually or in a CI/CD pipeline) into versionable, signable, and portable ModelKits, complete with YAML files for seamless deployment to Kubernetes and other container platforms. The result is consistent version tracking and reliable model artifacts across all environments.

Learning Objectives

Understand what KitOps is and how it makes ML model packaging scalable
Learn why pairing KitOps with Kubernetes is an obvious choice for deployment
See how you can easily package a Hugging Face model into a ModelKit using KitOps
Explore how Jozu, a registry built for ModelKits, simplifies Kubernetes deployments
See why KitOps + Kubernetes is a game changer

What is KitOps?

KitOps is an open-source model registry that helps package your model, data, code, config, and prompt files into one portable artifact. KitOps allows data scientists and developers to collaborate on the same projects in different environments without worrying about model file structure changes, platform engineers can run the same artifact in Kubernetes, and nobody has to chase "it works on my machine" bugs or wonder if they are using the correct dependencies.

KitOps is composed of three simple pieces:

1. Kitfile: It's a small YAML file that lists your code paths, datasets, runtime commands, and dependencies. You can see at a glance what your model needs.

2. ModelKit: This is the packaged artifact that includes code, weights, data, and Kitfile. It can be pushed to any OCI container registry like Docker Hub, Jozu Hub, GHCR, ECR, or Artifactory. Developers can treat it just like a Docker Image. You can tag it, version it, roll it back, sign it, and scan it like any other container.

3. Kit CLI: It allows you to pack, sign, push, and run ModelKits locally or in a CI/CD pipeline. The same commands work on macOS, Linux, or the build runner in your pipeline.

Why Use KitOps?

KitOps solves most problems software engineers encounter when moving a model to production. It provides a solution for version control, editing model artifacts, and ensuring consistency across environments.

Here are a few reasons why using KitOps' ModelKits can be a scalable option:

Easy Collaboration: Back-end devs, data scientists, ML Engineers, and SREs all pull the same ModelKit. No one wastes time rewriting paths or copying secret .env files.
Reproducibility: The Kitfile pins code, data checksum, and even the Python entry point. So if the build says flan-t5-small @ sha256:..., that exact checkpoint is what runs in prod.
Version Control: ModelKits stay in your container registry, so tags (0.3.1, qa-candidate, rollback-hotfix) work exactly like they do for Docker images.
Data Protection: Cosign signing and provenance files keep tampered weights from sneaking in. Also, kitops-init can verify signatures before a pod ever starts.
Cloud Agnostic Deployments: Whether you run Kind on a laptop, EKS in AWS, or an on-prem GPU node, the workflow is identical.
Cost Effectiveness: Because weights stay in the ModelKit rather than the container image, rebuilding your inference image is faster, reducing overhead.

Exploring 2 Use Cases with KitOps + Jozu

The standout feature of KitOps is how easily it wraps your model, code, data, and config into a single ModelKit. From there, you can roll that same artifact straight into production, whether you prefer a quick Docker run on your laptop or a full Kubernetes rollout in the cloud with services like GKE or EKS. Let's walk through both sides of the story: first, packaging a ModelKit, then deploying it with just a couple of commands.

What you need:

Latest KitOps CLI: Packs, pushes, signs, and unpacks ModelKits. Keep it current so you get signature verification and OCI-layout fixes.
Jozu Hub account: It's your personal OCI registry for both ModelKits and the runtime images that Jozu builds for you (Jozu Rapid Inference Containers). Tags and Cosign signing are all built into the ecosystem.
A model in Jozu Hub or Hugging Face: KitOps is source agnostic—point the Kitfile at a local directory or pull a pre-built ModelKit from Jozu, merge LoRA adapters, convert to GGUF, whatever you need before kit pack.

Install & check KitOps:
Head to the install page (https://kitops.org/docs/cli/installation/). Choose the guide for your OS (macOS, Linux, or Windows).

Verify the CLI is on your PATH: Once you follow the guide above and install KitOps, you can now verify if the Kit CLI is up and running using the kit version command. The output shows the version details.

Sign Up for a Jozu Hub Sandbox Account: Once you have KitOps installed, it's time to create an account in Jozu—note that this is a sandbox account, and that Jozu Hub is typically installed on-prem for secure model development. Head to jozu.ml and click on Sign Up to get registered.

Once you are done with the onboarding, you are ready to push our ModelKit. The official Jozu workflow is straightforward: pack → push → see it in your repo. No need to create a repository manually beforehand.

Log in from your terminal: Open a shell where the Kit CLI is installed and run kit login jozu.ml. It prompts you to enter your username, which is the email you registered with, and password you created. When successful, it will return "Login successful."

Time to package your first ModelKit and ship it to Jozu Hub.

Part 1: Packaging Models with KitOps on Jozu

Before we think about Kubernetes or autoscaling, we need one clean, reproducible artifact that someone can pull locally or in the cloud, or in a Kubernetes cluster. That artifact is a ModelKit, and we will use KitOps to build it. Make sure you have Python installed locally on your system.

Here's a minimal folder layout we'll work from:

kitops-demo/
├── data/               # tiny.csv - 20-50 spam/ham examples
├── src/
│   ├── train.py        # LoRA fine-tune script
│   └── app.py          # FastAPI inference server (for local test)
├── requirements.txt    # Python deps
└── (Kitfile)           # written by `kit init` in a minute

That's all we need for now. One data file, two Python scripts, a requirements.txt, and soon a Kitfile. In the next steps, we'll (1) fine-tune the model, (2) package everything into a ModelKit, and (3) push it to Jozu Hub so anyone can pull the exact same artifact.

1. Set up a clean Python environment

Let's first start with a Python environment and a requirements.txt file where we will define all our dependencies.

To create a virtual env use these commands:

python -m venv .venv && source .venv/bin/activate

Then create a requirements.txt file:

fastapi==0.104.1
uvicorn==0.24.0
pydantic==2.5.0
transformers==4.41.0
torch>=2.2.0
peft==0.7.0
datasets==2.14.0
accelerate==0.21.0
huggingface-hub==0.19.0

Then use:

pip install -r requirements.txt

to install all the dependencies. You now have everything needed to train a tiny FLAN-T5 model in a few minutes on the CPU.

2. Create a tiny demo dataset

Make a data/ folder and drop in a tiny.csv file with two columns:

text,label
"Free entry in 2 a wkly comp to win FA Cup final tkts ...",spam
"Hey how are you doing today?",ham
"WINNER!! As a valued network customer you have been selected ...",spam
"Can you pick up some milk on your way home?",ham

3. Fine-tune the Model with LoRA

We will then create our training program. Create a src folder that will contain the Python logic for training and running the model:

src/train.py

import datasets
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments
from peft import get_peft_model, LoraConfig, TaskType

BASE = "google/flan-t5-small"
ds = datasets.load_dataset("csv", data_files="data/tiny.csv")["train"]

def add_prompt(r):
    r["prompt"] = f"Classify as spam or ham: {r['text']}"
    r["answer"] = f"Answer: {r['label']}"
    return r

ds = ds.map(add_prompt)
tok = AutoTokenizer.from_pretrained(BASE)

def tok_fn(b):
    src = tok(b["prompt"], truncation=True, padding="max_length", max_length=128)
    with tok.as_target_tokenizer():
        tgt = tok(b["answer"], truncation=True, padding="max_length", max_length=8)
    src["labels"] = tgt["input_ids"]
    return src

ds = ds.map(tok_fn, batched=True).remove_columns(["text", "label", "prompt", "answer"])
ds.set_format("torch")

model = AutoModelForSeq2SeqLM.from_pretrained(BASE)
model = get_peft_model(model, LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, r=8))

args = Seq2SeqTrainingArguments("ft-run", num_train_epochs=1,
                                per_device_train_batch_size=4)
trainer = Seq2SeqTrainer(
    model, args, train_dataset=ds,
    data_collator=DataCollatorForSeq2Seq(tok, model))

trainer.train()
model.save_pretrained("model-root")   # flattened folder
tok.save_pretrained("model-root")
print("✅  LoRA fine-tune complete - weights in ./model-root")

In a nutshell, we take a tiny CSV of text messages, fine-tune Google's FLAN-T5 with LoRA, and save the new weights. We will use KitOps to bundle those weights + our code + a one-page YAML recipe into a ModelKit.

4. Training Our Model

We will run our script once:

python src/train.py

The command fine-tunes FLAN-T5 on the CSV, drops the new weights into model-root/, and prints a "finished" message when it's done.

5. Create a simple FastAPI inference

To run our model we will create a simple FastAPI inference so that we can interact with it via endpoints:

src/app.py

import os, uvicorn
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

MODEL_DIR = os.getenv("MODEL_PATH", "model-root")
tok    = AutoTokenizer.from_pretrained(MODEL_DIR)
model  = AutoModelForSeq2SeqLM.from_pretrained(MODEL_DIR)
predict= pipeline("text2text-generation", model=model, tokenizer=tok)

app = FastAPI()

class Item(BaseModel): text: str

@app.post("/predict")
def _p(i: Item):
    out = predict(i.text, max_length=32)[0]["generated_text"]
    return {"input": i.text, "prediction": out}

@app.get("/health")
def _h(): return {"ok": True}

if __name__ == "__main__":
    uvicorn.run("src.app:app", host="0.0.0.0", port=8000, reload=True)

6. Quick Local Smoke Test of our model

Before we pack or push anything, let's check if the model works. Run python src/app.py

The FastAPI server starts on http://localhost:8000. We will use this curl command to test out the endpoint:

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{"text": "Classify this text as spam or ham: FREE tickets just for you!"}'

If that works, the weights, tokenizer, and inference code are all in sync, exactly what we'll package with KitOps and ship to Jozu in the next step.

7. Create a Kitfile

Run this command in your terminal, from the project root (kitops-demo/):

kit init .

Open the generated Kitfile and edit just the model path:

And we are good to go for the next step.

8. Pack and push to Jozu Hub

Before pushing your ModelKit to Jozu, make sure you have a Kitfile in place. We will package everything (code + weights + Kitfile) into a ModelKit layer using this command:

kit pack . -t jozu.ml/<user>/text-classifier:<Version_Tag>

Once we have successfully packed the ModelKit, we are ready to upload that layer to the Jozu repository:

kit push jozu.ml/<user>/text-classifier:<Version_Tag>

To understand what we did, let's break the push command down. A fully-qualified destination tag has four parts:

[registry address] / [user-or-org] / [repository name] : [tag]
       │                  │                │             │
    jozu.ml        arnabchat2001    text-classifier   0.2.0

And once it's pushed successfully, your image will be visible in your Jozu Repository.

Like other OCI Images, we can sign our ModelKit as well. Signing your uploaded ModelKit with Cosign adds an extra layer of security, proving the model came from you and hasn't been tampered with.

It's optional, but highly recommended for any collaborative or production use. Run:

cosign generate-key-pair

then:

cosign sign jozu.ml/<user>/<repo>:<tag> --key cosign.key

You should do this after every push to make your ModelKit verifiable by others. In your repository in Jozu, it will now show a signed badge.

And it's all done. To do a sanity check, run:

kit inspect jozu.ml/<user>/text-classifier:<tag>

You should be able to see your model-root/config.json, model-root/pytorch_model.bin, and Kitfile.

If successful, you've built a beginner-sized ModelKit that is version-controlled, shareable, and ready for any runtime. Next, we will deploy that project using Kubernetes.

Part 2: Deploying a KitOps ModelKit on Kubernetes

Once your ModelKit is packaged and uploaded to Jozu Hub, the next step is to deploy it in a scalable, production environment. Jozu's deploy to Kubernetes feature makes this possible by orchestrating containers, automating deployments, and allowing seamless updates.

Before moving to Kubernetes, it's worth doing a quick local test to make sure your ModelKit works as expected. In Jozu Hub, open your ModelKit's page, select Deploy, under that select Docker, choose the appropriate runtime (e.g., Basic, Llama.cpp, vLLM), and copy the provided command. It will look like:

docker run -it --rm jozu.ml/arnabchat2001/text-classifier/basic:0.6.0

If your model serves an API, you can add -p 8000:8000 to map the port and then send a request to http://localhost:8000/predict to confirm it's working. This quick check ensures the ModelKit itself runs fine before you scale it up on Kubernetes.

Here's a step-by-step walkthrough to deploy your ModelKit on Kubernetes.

1. Prerequisites

A running Kubernetes cluster (we will use minikube locally for this tutorial)
kubectl CLI configured and connected
(Optional) Docker installed for local cluster
A ModelKit hosted on Jozu Hub

2. Installing the Requirements

Depending on the device, there are several ways to install these requirements. Check out this guide on downloading Kubernetes.

Then, verify the installation using the command kubectl version --client.

3. Create a Kubernetes Namespace (Optional but Recommended)

Namespaces help keep things isolated, especially if you're running multiple models:

kubectl create namespace kitops-demo

4. Prepare Deployment and Service YAML

This example follows the KitOps init-container pattern. Jozu Hub can generate ready-to-apply Kubernetes YAML for every ModelKit you push.

The exact manifest depends on the Deployment platform and Container type you choose.

Open your model's repository on Jozu and select the Deploy tab → Kubernetes. Pick a container type (e.g., KitOps Init Container for a lightweight custom runtime, or Basic / Llama.cpp / vLLM for prebuilt images), and copy the YAML.

Tweak only the app-specific bits instead of writing a manifest from scratch.

Note: If you choose a prebuilt image like Basic, you won't need the initContainers and volumes shown below.

For this example, we're using Kubernetes and will create two YAML files inside the k8s folder:

deployment.yaml – tells Kubernetes how to start your model
service.yaml – exposes your API for access

k8s/deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: text-classifier
  labels:
    app: text-classifier
spec:
  replicas: 1
  selector:
    matchLabels:
      app: text-classifier
  template:
    metadata:
      labels:
        app: text-classifier
    spec:
      # --- Shared volume for model/code (init → app) ---
      volumes:
        - name: model-store
          emptyDir: {}
      # --- Comes from Jozu's init-container template ---
      initContainers:
        - name: kitops-init # ← copy this value from Jozu Hub
          image: ghcr.io/kitops-ml/kitops-init:latest
          env:
            - name: MODELKIT_REF
              value: "jozu.ml/arnabchat2001/text-classifier:0.4.0"
            - name: UNPACK_PATH
              value: "/model"
            - name: UNPACK_FILTER
              value: "model,code"
          volumeMounts:
            - name: model-store
              mountPath: /model
      # ---------- Demo API Container ----------
      containers:
        - name: api
          image: python:3.9-slim
          command: ["/bin/bash"]
          args:
            - -c
            - |
              echo "Installing dependencies..."
              pip install --no-cache-dir fastapi uvicorn pydantic transformers torch peft datasets
              echo "Starting application..."
              cd /model/src
              python3 app.py
          env:
            - name: MODEL_PATH
              value: "/model/model-root"
          ports:
            - containerPort: 8000
          volumeMounts:
            - name: model-store
              mountPath: /model
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 15
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
          resources:
            requests: { cpu: 200m, memory: 1Gi }
            limits: { cpu: 1000m, memory: 2Gi }

k8s/service.yaml

apiVersion: v1
kind: Service
metadata:
  name: text-classifier
spec:
  selector:
    app: text-classifier
  ports:
    - port: 80
      targetPort: 8000

The deployment.yaml spins up a pod with two containers. First is an init container (kitops-init) that grabs the tagged ModelKit from Jozu Hub and unpacks both the model weights and the inference code into a shared volume.

Once that finishes, the main api container boots a light Python image, installs the required libraries, and launches the FastAPI server, reading the model files straight from that same volume. Readiness probes, CPU/memory limits, and a single replica keep the deployment predictable and easy to scale later.

The service.yaml turns that pod into an addressable endpoint inside the cluster. It selects any pod with app: text-classifier and forwards traffic from port 80 to the FastAPI port 8000. Internally, other workloads can hit http://text-classifier/; for local debugging, you simply run:
kubectl port-forward service/text-classifier 8080:80 and call http://localhost:8080/

5. Deploy to Kubernetes

Now, we need to check if our Kubernetes environment is started and running using the minikube status command:

If it's not started, you can start it using minikube start

Once we verify it's up and running, we will apply our manifests by running:

kubectl apply -f k8s/

This will apply both files from the directory.

Now it will start running your pods—you can check the progress using:

minikube kubectl -- get pods

After a minute, you should see READY 1/1.

If needed, you can check logs to ensure everything is running by using:

minikube kubectl -- logs <POD Name> -c api --tail=10

6. Expose Your Model with Port Forwarding

Once the service is running, we will enable port forwarding to access the API locally:

minikube kubectl -- port-forward deployment/text-classifier 8080:8000

Then test our deployed model at http://localhost:8080/. You can send requests to your model, just as if it were running locally.

7. Test the Deployed Endpoint

We will run a curl command to send a test payload to our running FastAPI server. Check if our models are working properly:

curl -X POST "http://localhost:8080/generate" \
     -H "Content-Type: application/json" \
     -d '{"text":"Free money! Click here to win $1000 now!"}'

And we should get a response like:

{"response":"spam"}

Which ensures the model is running correctly.

[Image 20: Terminal showing successful API response]

We can see that the model is able to correctly identify spam and ham, which confirms our entire workflow, from local training to packaging to remote deployment and live inference, is working as intended.

Why Use KitOps + Kubernetes?

Between testing every other deployment option, you can also see what makes KitOps and Kubernetes different.

Scalability: When KitOps is paired with Kubernetes, you can easily scale your model. This means anyone can go from prototyping new features to pushing them live without hassle or downtime.
Version Control for Models: KitOps lets you bring true version control to your ML workflow. Rolling back to an older model or updating a new one is as simple as switching a tag.
Consistency Across Environments: KitOps packages everything your model needs into a ModelKit. Whether you deploy locally or in the cloud.

Wrapping Up

KitOps provides a lightweight and flexible method for deploying machine learning models into deployable units. It also provides an infrastructure that eliminates the challenges of versioning, file structures, and alteration in different environments. With Kubernetes, you can ensure scalable ML deployments are made simple.

This article gives a blueprint for using KitOps and Kubernetes to deploy your model. From pulling the model from Hugging Face, pushing it, and deploying it to a Kubernetes cluster with KServe, KitOps makes this process seamless.

You can apply this process across various models even more easily with the KitOps feature that allows you to import Hugging Face models.

Finally, make sure your Kit CLI, Kubernetes, and all other tools are kept up to date for the best experience. And don't be afraid to experiment—KitOps and Kubernetes together can seriously upgrade your ML deployment experience. You might be surprised how much simpler your workflow becomes!

DEV Community