DEV Community

Cover image for Serving LLMs at Scale with KitOps, Kubeflow, and KServe
Jesse Williams for Jozu

Posted on • Originally published at jozu.com

Serving LLMs at Scale with KitOps, Kubeflow, and KServe

Introduction

Over the past few years, large language models (LLMs) have transformed how we build intelligent applications. From chatbots to code assistants, these models are used to power production systems across industries. But while training LLMs has become more accessible, deploying them at scale remains a challenge. Models generally come with gigabyte-sized weight files, depend on specific library versions, require careful GPU or CPU resource allocation, and need constant versioning as new checkpoints roll out. More often than not, a model that works in a data scientist's notebook can fail in production because of a mismatched dependency, a missing tokenizer file, or an environment variable that wasn't set.

KitOps (a CNCF project backed by Jozu) offers a solution called ModelKits, which is a standardized artifact that packages an ML model with its dependencies and configuration. This open-source toolkit lets organizations, developers, and data scientists bundle their models into versionable, signable, and portable ModelKits that can be pushed to any OCI-compliant registry. The result is consistent version tracking and reliable model artifacts across all environments, bringing the same level of control we expect from software development to machine learning deployments.

In this guide, we'll show you how to combine KitOps with Kubeflow and KServe to serve large language models at scale. You'll learn how to package an LLM into a ModelKit, deploy it with KServe's inference endpoints, and let Jozu handle the orchestration, all without needing dedicated GPU hardware to follow along—you can take an even deeper dive into production ML on Kubernetes by downloading our full technical guide to Kubernetes ML.

Learning Objectives

  • Build and package a TensorFlow LLM model into a ModelKit using KitOps
  • Pack and push the ModelKit to Jozu, an OCI-compliant registry built for ModelKits
  • Set up Kubeflow and KServe to serve your model in production
  • Scale and secure your model deployments in production environments

Prerequisites and Setup

Before we start deploying LLMs at scale, let's make sure you have the right tools installed and configured. This section walks through everything you need such as Python for running your model code, the KitOps CLI for packaging ModelKits, and a Jozu sandbox account for storing and managing your artifacts.

Install Python

For this project, you'll need Python 3.10 or above installed on your system. This ensures compatibility with modern ML libraries like TensorFlow and the dependencies we'll use throughout this guide. If you don't have Python installed yet, grab it from python.org and follow the installation steps for your operating system.

Install the KitOps CLI

The Kit CLI is what we'll use to pack, push, and manage ModelKits. Head over to the KitOps installation page and pick the installation method that matches your OS, whether you're on macOS, Linux, or Windows, and install accordingly.

Once you've installed the CLI, verify it's working by running:

kit version  
Enter fullscreen mode Exit fullscreen mode

The output should show the version details:

Sign Up for Jozu

Jozu is your OCI-compliant registry for ModelKits. It's where you'll push packaged models and pull them during deployment. To get started with Jozu, head over to jozu.ml and click Sign Up to create an account. Make sure to note your username and password as you'll need them in the next step to authenticate your CLI.

Authenticate with Jozu

Now let's connect your local Kit CLI to your Jozu account. Open a terminal and run:

kit login jozu.ml  
Enter fullscreen mode Exit fullscreen mode

You'll be prompted to enter your username (the email you registered with) and the password you created. If everything is set up correctly, you'll see:

Building a TensorFlow LLM Model

TensorFlow is one of the most popular open-source frameworks for building and training machine learning models. It was developed by Google, and it's particularly well-suited for production environments where you need scalable, efficient model serving across CPUs, GPUs, and TPUs.

TensorFlow shines in enterprise deployments, mobile applications, and in scenarios where you need tight integration with serving infrastructure. In this guide, we'll use TensorFlow to fine-tune a small T5 model that translates corporate jargon into plain language.

Set Up Your Project Directory

Let's start by creating a clean workspace for our model. Run these commands in your terminal to create your project directory:

mkdir corporate-speak  
cd corporate-speak  
Enter fullscreen mode Exit fullscreen mode

Now create a Python virtual environment to keep dependencies isolated. It is essential to use a virtual environment as it isolates the project's dependencies from your global Python installation, therefore preventing conflicts with other projects and ensuring reproducible results:

python3 -m venv env  
source env/bin/activate  
Enter fullscreen mode Exit fullscreen mode

Install Dependencies

Create a requirements.txt file in your project root with the following libraries:

tensorflow==2.19.1   
transformers==4.49.0  
huggingface-hub==0.26.0   
tf-keras  
fastapi  
uvicorn  
sentencepiece  
Enter fullscreen mode Exit fullscreen mode

Install everything with:

pip install -r requirements.txt  
Enter fullscreen mode Exit fullscreen mode

This pulls in TensorFlow for training, Transformers for the T5 model, FastAPI for serving later, and all the supporting libraries we'll need.

Create the Training Data

Before we can train our model, we need some data. Create a data directory in your project root:

mkdir data  
Enter fullscreen mode Exit fullscreen mode

Inside the data directory, create a file called corporate\_speak.json and paste this training dataset:

[  
  {  
    "term": "Circle back",  
    "meaning": "We'll talk about this later because we don't want to deal with it right now."  
  },  
  {  
    "term": "Synergy",  
    "meaning": "Making two teams do one team's job, but with extra meetings."  
  },  
  {  
    "term": "Bandwidth",  
    "meaning": "How much energy or patience a person has left."  
  },  
  {  
    "term": "Low-hanging fruit",  
    "meaning": "The easiest task that still lets us look productive."  
  },  
  {  
    "term": "Touch base",  
    "meaning": "Talk briefly to pretend progress is being made."  
  },  
  {  
    "term": "Pivot",  
    "meaning": "Our original idea failed; let's rename it and try again."  
  },  
  { "term": "Going forward", "meaning": "Forget what we said last time." },  
  { "term": "Alignment", "meaning": "Make sure no one disagrees publicly." }  
]  
Enter fullscreen mode Exit fullscreen mode

This small dataset gives the model eight examples of corporate jargon and their plain-language meanings. It's just enough to fine-tune T5 for our demonstration without requiring heavy compute resources.

Create the Training Script

Next, make a directory for your application code:

mkdir app  
Enter fullscreen mode Exit fullscreen mode

Inside the app directory, create a file called train\_llm.py and add this code:

import os  
import json  
import tensorflow as tf  
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

BASE\_DIR \= os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_)))  
DATA\_PATH \= os.path.join(BASE\_DIR, "data", "corporate\_speak.json")

print(f"Base Directory: {BASE\_DIR}")  
print(f"Data Path: {DATA\_PATH}")

def load\_data(file\_path):  
    """Loads JSON data from the specified file path."""  
    try:  
        with open(file\_path, 'r') as f:  
            data \= json.load(f)  
        print(f"Successfully loaded {len(data)} records from data file.")  
        return data  
    except FileNotFoundError:  
        print(f"ERROR: Data file not found at {file\_path}")  
        print("Please ensure you have created the file 'corporate\_speak.json' and the 'data' folder.")  
        return None  
    except json.JSONDecodeError:  
        print(f"ERROR: Could not decode JSON from {file\_path}. Check file format.")  
        return None

DATA \= load\_data(DATA\_PATH)  
if DATA is None:  
    exit() ## Stop if data loading failed

prompts \= [f"term: {item['term']}" for item in DATA]  
responses \= [f"meaning: {item['meaning']}" for item in DATA]

MODEL\_NAME \= 't5-small'   
MAX\_LENGTH \= 128  
BATCH\_SIZE \= 4            
LEARNING\_RATE \= 1e-5      
EPOCHS \= 15             

print(f"\\nLoading T5 model and tokenizer: {MODEL\_NAME}...")  
tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_NAME)  
model \= TFT5ForConditionalGeneration.from\_pretrained(MODEL\_NAME)

tokenized\_inputs \= tokenizer(  
    prompts,  
    return\_tensors='tf',  
    max\_length=MAX\_LENGTH,  
    padding='max\_length',  
    truncation=True  
)

tokenized\_targets \= tokenizer(  
    responses,  
    return\_tensors='tf',  
    max\_length=MAX\_LENGTH,  
    padding='max\_length',  
    truncation=True  
)

labels \= tokenized\_targets['input\_ids']

dataset \= tf.data.Dataset.from\_tensor\_slices(  
    (  
        {'input\_ids': tokenized\_inputs['input\_ids'],  
         'attention\_mask': tokenized\_inputs['attention\_mask']},  
        labels  
    )  
).shuffle(buffer\_size=len(DATA)).batch(BATCH\_SIZE)

print("\\n--- Starting Fine-Tuning ---")

optimizer \= tf.keras.optimizers.Adam(learning\_rate=LEARNING\_RATE)

model.compile(optimizer=optimizer)

history \= model.fit(  
    dataset,  
    epochs=EPOCHS,  
    verbose=1  
)

print("--- Fine-Tuning Complete ---")

print("\\n--- Testing Model Generation ---")

test\_term\_1 \= "term: Touch base"  
test\_input\_1 \= tokenizer(test\_term\_1, return\_tensors='tf').input\_ids

output\_tokens\_1 \= model.generate(test\_input\_1, max\_length=MAX\_LENGTH)  
decoded\_meaning\_1 \= tokenizer.decode(output\_tokens\_1[0], skip\_special\_tokens=True)

print(f"Input: '{test\_term\_1}'")  
print(f"Output: '{decoded\_meaning\_1}'")

test\_term\_2 \= "term: Alignment"  
test\_input\_2 \= tokenizer(test\_term\_2, return\_tensors='tf').input\_ids  
output\_tokens\_2 \= model.generate(test\_input\_2, max\_length=MAX\_LENGTH)  
decoded\_meaning\_2 \= tokenizer.decode(output\_tokens\_2[0], skip\_special\_tokens=True)

print(f"\\nInput: '{test\_term\_2}'")  
print(f"Output: '{decoded\_meaning\_2}'")

MODEL\_SAVE\_PATH \= os.path.join(BASE\_DIR, "1")  
os.makedirs(MODEL\_SAVE\_PATH, exist\_ok=True)

model.save(MODEL\_SAVE\_PATH, save\_format='tf')   
tokenizer.save\_pretrained(MODEL\_SAVE\_PATH)  
print(f"\\nModel saved to: {MODEL\_SAVE\_PATH}")  
Enter fullscreen mode Exit fullscreen mode

This script does four things: it loads your training data from a JSON file, tokenizes the inputs and targets for T5, fine-tunes the model for 15 epochs, and saves the trained weights along with the tokenizer to a directory called 1 in your project root.

It is important to save your model in a numbered directory or version number, as the Tensorflow Kserve program, expects to find your model in this format. Anything that deviates from this will prevent your Kserve inference service from working.

Train the Model

To train your model, run the following command from the root directory:

python3 app/train\_llm.py  
Enter fullscreen mode Exit fullscreen mode

The training process will kick off, and you'll see output showing the model loading, training progress across epochs, test predictions, and finally confirmation that the model has been saved. When complete, you'll have a new directory called 1 containing your model's saved weights (saved_model.pb), variables, tokenizer config files, and all the assets TensorFlow needs to reload and serve your model later.

Testing the Model with FastAPI

Before we package our model for production, let's make sure it actually works. We'll build a simple FastAPI inference server that loads the trained model and exposes an endpoint for predictions.

Create the Inference Server

In your app directory, create a file called inference.py and add this code:

import os  
import tensorflow as tf  
from transformers import T5Tokenizer, TFT5ForConditionalGeneration  
from fastapi import FastAPI, HTTPException  
from pydantic import BaseModel  
import uvicorn

app \= FastAPI(  
    title="Jargon Decoder LLM API",  
    description="A service to translate corporate jargon using a fine-tuned T5 model.",  
    version="1.0.0"  
)

tokenizer \= None  
model \= None  
MAX\_LENGTH \= 128

BASE\_DIR \= os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_)))  
MODEL\_SAVE\_PATH \= os.path.join(BASE\_DIR, "1")

@app.on\_event("startup")  
async def load\_model\_on\_startup():  
    """Loads the fine-tuned T5 model and tokenizer when the FastAPI application starts."""  
    global tokenizer, model

    print(f"Base Directory: {BASE\_DIR}")  
    print(f"Attempting to load model from: {MODEL\_SAVE\_PATH}")  

    try:  
        tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_SAVE\_PATH)  
        model \= TFT5ForConditionalGeneration.from\_pretrained(MODEL\_SAVE\_PATH)  
        print("Model and tokenizer loaded successfully\! 🚀")  
    except Exception as e:  
        print(f"FATAL ERROR: Could not load model from {MODEL\_SAVE\_PATH}.")  
        print(f"Details: {e}")

class JargonRequest(BaseModel):  
    """Schema for the input request."""  
    term: str \= "Circle back"

class JargonResponse(BaseModel):  
    """Schema for the output response."""  
    original\_term: str  
    decoded\_meaning: str

def decode\_jargon(term: str, tokenizer, model) -> str:  
    """  
    Core function to run inference on the loaded LLM.  
    """  
    if not tokenizer or not model:  
        raise HTTPException(status\_code=503, detail="Model is not loaded or ready.")

    prompt \= f"term: {term}"  


    input\_ids \= tokenizer(  
        prompt,   
        return\_tensors='tf',   
        max\_length=MAX\_LENGTH,   
        padding='max\_length',   
        truncation=True  
    ).input\_ids  


    output\_tokens \= model.generate(  
        input\_ids,  
        max\_length=MAX\_LENGTH  
    )  


    decoded\_meaning \= tokenizer.decode(output\_tokens[0], skip\_special\_tokens=True)  


    if decoded\_meaning.startswith("meaning: "):  
        return decoded\_meaning[9:].strip()  

    return decoded\_meaning.strip()

@app.post("/decode/", response\_model=JargonResponse)  
async def decode(request: JargonRequest):  
    """  
    API endpoint to translate a corporate jargon term into plain meaning.  
    """  
    try:  
        meaning \= decode\_jargon(request.term, tokenizer, model)  
        return JargonResponse(  
            original\_term=request.term,  
            decoded\_meaning=meaning  
        )  
    except HTTPException as e:  
        ## Re-raise explicit HTTP exceptions  
        raise e  
    except Exception as e:  
        ## Handle unexpected errors  
        print(f"Inference Error: {e}")  
        raise HTTPException(status\_code=500, detail=f"Internal server error during inference: {e}")

if \_\_name\_\_ \== "\_\_main\_\_":  
    uvicorn.run("inference:app", host="0.0.0.0", port=8000, reload=True)  
Enter fullscreen mode Exit fullscreen mode

This inference script sets up a FastAPI application that loads your fine-tuned T5 model on startup. The load_model_on_startup function pulls the tokenizer and model from the saved directory, making them available globally. The decode_jargon function handles the actual inference: it takes a corporate term, formats it as a prompt, runs it through the model, and returns the decoded meaning.

The /decode/ endpoint accepts POST requests with a jargon term and responds with the plain-language translation. Pydantic models ensure type safety for requests and responses, while error handling catches issues like missing models or inference failures.

Start the Server

Run the inference server from your project root:

python3 app/inference.py  
Enter fullscreen mode Exit fullscreen mode

You'll see output showing the model loading and a confirmation that the FastAPI server is running on http://0.0.0.0:8000. The startup event will trigger immediately, pulling your trained weights into memory so they're ready for inference requests.

Test the Endpoint

To test the endpoint, open a new terminal and send a test request with curl:

curl -X POST "http://localhost:8000/decode/" \\  
     -H "Content-Type: application/json" \\  
     -d '{"term": "Synergy"}'  
Enter fullscreen mode Exit fullscreen mode

If everything is working, you should see a JSON response with the decoded meaning:

{  
    "original\_term": "Synergy",  
    "decoded\_meaning": "Synergy"  
}  
Enter fullscreen mode Exit fullscreen mode

The code and model is working and producing an output which is what we expect. Now that we've confirmed everything works locally, we can package the entire application code, model, and dependencies into a ModelKit for production deployment.

Packaging with KitOps

To make the workflow repeatable and production ready we'll use KitOps to bundle our trained model, inference code, and training data into a single ModelKit.

Initialize the Kitfile

From your project root directory, run:

kit init .  
Enter fullscreen mode Exit fullscreen mode

This creates a Kitfile in your current directory. A Kitfile is a YAML manifest that describes everything needed to reproduce your ML project—model weights, code paths, datasets, and metadata. Think of it like a Dockerfile, but designed specifically for machine learning artifacts. It tells KitOps what to bundle into your ModelKit and how those pieces fit together.

Edit the Kitfile

The generated Kitfile is a good starting point, but it doesn't capture the full structure of our project. Open the Kitfile and replace its contents with this:

manifestVersion: 1.2.0

package:  
  name: corporate-speak-model  
  description: A lightweight language model fine-tuned on corporate jargon to explain complex corporate terms in simple English.  
  authors: [Thoren Oakenshield]

code:  
  - path: .   
    description: All necessary scripts, configurations, and application logic

model:  
  name: T5  
  path: ./1/  
  framework: Tensorflow  
  version: 1.2.0  
  description: A lightweight language model fine-tuned on corporate jargon to explain complex corporate terms in simple English.

datasets:  
  - name: corporate-jargon-data  
    path: ./data/  
    description: A small JSON dataset containing corporate terms and their real-world meanings.  
Enter fullscreen mode Exit fullscreen mode

Let's break down what this Kitfile does. The package section holds metadata which are the model name, a description, and the author. Next, the code section points to your entire project directory, capturing all your scripts, configuration files, and application logic.

Then, the model section specifies where your trained T5 weights live (the ./1/ directory we created during training), what framework they use, and the version. Finally, the datasets section references your training data in ./data/, so anyone pulling this ModelKit knows exactly what data was used to train the model. This single file gives you a complete snapshot of your ML project.

Pack the ModelKit

Now let's bundle everything into a ModelKit, similar to how you build a Docker image. To pack your ModelKit run:

kit pack . -t jozu.ml/<username>/<model-kit-name>:<version>  
Enter fullscreen mode Exit fullscreen mode

Replace with your Jozu username and : with your model kit name and version. This command reads your Kitfile, collects all the referenced files (code, model weights, data), and packages them into a single OCI-compliant artifact. You'll see output showing KitOps compressing and layering your files.

Push to Jozu

Once the pack completes, push your ModelKit to Jozu by running:

kit push jozu.ml/<username>/<model-kit-name>:<version>  
Enter fullscreen mode Exit fullscreen mode

The CLI uploads your ModelKit layers to the registry. When it finishes, head to your Jozu account at jozu.ml, click on My Repositories, and you should see your newly pushed package listed.

Setting Up the Serving Infrastructure

Before we can deploy our model with KServe, we need to set up the complete infrastructure stack. This includes Docker for containerization, Kubernetes for orchestration, Kubeflow for ML workflows, and KServe for model serving. Let's walk through each installation step by step.

Install Docker

Docker is the container runtime that Minikube will use. If you're on Linux, run:

sudo apt-get update && sudo apt-get install docker.io -y  
sudo groupadd docker  
sudo usermod -aG docker $USER  
newgrp docker  
Enter fullscreen mode Exit fullscreen mode

For macOS or Windows users, head to the official Docker website and follow the installation instructions for your operating system.

Install kubectl

kubectl is the command-line tool for interacting with Kubernetes clusters. It lets you deploy applications, inspect resources, and manage cluster operations.

To Install it run:

sudo snap install kubectl --classic  
kubectl version --client  ## Verify installation  
Enter fullscreen mode Exit fullscreen mode

Install Minikube

Next is Minikube. Minikube runs a local Kubernetes cluster on your machine which is perfect for development and testing without needing cloud resources. TO download and install it, run:

curl -LO https://github.com/kubernetes/minikube/releases/latest/download/minikube-linux-amd64  
sudo install minikube-linux-amd64 /usr/local/bin/minikube && rm minikube-linux-amd64  
minikube version  
Enter fullscreen mode Exit fullscreen mode

Start Minikube

It's important to start your local Kubernetes cluster with enough resources to handle model serving else your cluster will fail in the process of serving your model. To start minikube run:

minikube start --cpus=4 --memory=10240 --driver=docker  
kubectl get nodes  
kubectl cluster-info  
Enter fullscreen mode Exit fullscreen mode

This spins up a single-node cluster with 4 CPUs and 10GB of memory. The kubectl get nodes command confirms your cluster is running, and kubectl cluster-info shows the control plane endpoint.

Install Kubeflow Pipelines

Kubeflow is an open-source platform for running ML workflows on Kubernetes. It provides tools for orchestrating complex pipelines, tracking experiments, and managing model training. We'll install Kubeflow Pipelines, which handles the deployment and serving orchestration:

export PIPELINE\_VERSION=2.4.0  
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE\_VERSION"  
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io  
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE\_VERSION"  
Enter fullscreen mode Exit fullscreen mode

This installation can take a few minutes. To check if all components are ready, run:

kubectl get pods -n kubeflow  
Enter fullscreen mode Exit fullscreen mode

Wait until all pods show Running status. You should see output similar to this:

NAME                                               READY   STATUS    RESTARTS      AGE  
cache-deployer-deployment-85b76bcb6-fmslx          1/1     Running   0             21h  
cache-server-66bd9b7875-rxdvl                      1/1     Running   0             21h  
metadata-envoy-deployment-746744dfb8-zdgtx         1/1     Running   0             21h  
metadata-grpc-deployment-54654fc5bb-9cvdg          1/1     Running   6 (21h ago)   21h  
metadata-writer-68658fdf4b-7zpbn                   1/1     Running   1 (20h ago)   21h  
minio-85cd46c575-gt7kp                             1/1     Running   0             21h  
ml-pipeline-6978d6f776-p4zt9                       1/1     Running   3 (20h ago)   21h  
ml-pipeline-persistenceagent-7d4c675666-28qnz      1/1     Running   1 (20h ago)   21h  
ml-pipeline-scheduledworkflow-695b7b8988-swzdj     1/1     Running   0             21h  
ml-pipeline-ui-88467988b-4c6md                     1/1     Running   0             21h  
ml-pipeline-viewer-crd-bf5dc64dd-5xqv9             1/1     Running   0             21h  
ml-pipeline-visualizationserver-5584ff64d7-jr686   1/1     Running   0             21h  
mysql-6745b5984c-dn4r6                             1/1     Running   0             21h  
workflow-controller-5b84568b94-tjjcz               1/1     Running   0             21h  
Enter fullscreen mode Exit fullscreen mode

Install KServe

KServe is a Kubernetes-native platform for serving ML models. It handles autoscaling, canary rollouts, and provides a unified inference protocol across different model frameworks. You can install it with:

curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.14/hack/quick\_install.sh" | bash  
Enter fullscreen mode Exit fullscreen mode

Once the installation completes, verify that KServe and its dependencies are running with the following commands:

kubectl get pods -n kserve  
kubectl get pods -n istio-system  
kubectl get pods -n knative-serving  
Enter fullscreen mode Exit fullscreen mode

You should see output confirming all components are operational:

NAME                                        READY   STATUS    RESTARTS   AGE  
kserve-controller-manager-86869697f-mcgrd   2/2     Running   0          20h

NAME                                    READY   STATUS    RESTARTS   AGE  
istio-ingressgateway-698fff54fb-bbqh7   1/1     Running   0          20h  
istiod-7fdcb55c9c-qtwf5                 1/1     Running   0          20h

NAME                                    READY   STATUS    RESTARTS   AGE  
activator-5967d4d645-fgfhw              1/1     Running   0          20h  
autoscaler-598c65f5bc-9pdt4             1/1     Running   0          20h  
autoscaler-hpa-5b45c655dc-hx4qd         1/1     Running   0          20h  
controller-7cf55b567b-x45bn             1/1     Running   0          20h  
knative-operator-76b6894f45-58xlt       1/1     Running   0          20h  
net-istio-controller-54b458f57b-7cqj7   1/1     Running   0          20h  
net-istio-webhook-7bc64cfff6-mslz9      1/1     Running   0          20h  
operator-webhook-565c994ff9-f7hzq       1/1     Running   0          20h  
webhook-7f575896d6-gc4qc                1/1     Running   0          20h  
Enter fullscreen mode Exit fullscreen mode

Create Registry Credentials

KServe needs credentials to pull your ModelKit from Jozu. To set up these credentials in your project directory, create a file called kitops-jozu-secret.yaml and add the following:

apiVersion: v1  
kind: Secret  
metadata:  
  name: jozu-registry-secret  
type: Opaque  
data:  
  KIT\_USER: <YOUR USERNAME ENCODED IN BASE 64>  
  KIT\_PASSWORD: <YOUR PASSWORD ENCODED IN BASE 64>  
Enter fullscreen mode Exit fullscreen mode

Replace the base64-encoded values with your own Jozu credentials. You can encode your username and password by running:

echo -n "your-username" | base64  
echo -n "your-password" | base64  
Enter fullscreen mode Exit fullscreen mode

Serving the Model with KServe

Now that our infrastructure is ready and our ModelKit is in the registry, let's deploy it with KServe. This section walks through configuring KServe to pull ModelKits, defining the inference service, and making predictions against the deployed endpoint.

Configure the Storage Initializer

KServe uses storage initializers to fetch model artifacts from registries before starting the inference container. We need to tell KServe how to pull ModelKits using the KitOps storage initializer. To do this create a file called kitops-storage-initializer.yaml:

apiVersion: serving.kserve.io/v1alpha1  
kind: ClusterStorageContainer  
metadata:  
  name: kitops  
spec:  
  container:  
    name: storage-initializer  
    image: ghcr.io/kitops-ml/kitops-kserve:latest  
    imagePullPolicy: Always  
    env:  
      - name: KIT\_UNPACK\_FLAGS  
        value: ""  
      - name: KIT\_USER  
        valueFrom:  
          secretKeyRef:  
            name: jozu-registry-secret  
            key: KIT\_USER  
            optional: true  
      - name: KIT\_PASSWORD  
        valueFrom:  
          secretKeyRef:  
            name: jozu-registry-secret  
            key: KIT\_PASSWORD  
            optional: true  
    resources:  
      requests:  
        memory: 100Mi  
        cpu: 100m  
      limits:  
        memory: 1Gi  
  supportedUriFormats:  
    - prefix: kit://  
Enter fullscreen mode Exit fullscreen mode

This ClusterStorageContainer defines a custom storage initializer that understands kit:// URIs. When KServe sees a storageUri starting with kit://, it uses this initializer to authenticate with Jozu (via the credentials in kit-secret), pull the ModelKit, unpack it, and mount the model artifacts into the inference container. The resource limits ensure the initializer doesn't consume too much memory during the download and unpacking phase.

Create the InferenceService

An InferenceService is KServe's core resource for deploying models. It handles routing, autoscaling, canary deployments, and connects your model to a scalable serving runtime. Create a file called kitops-kserve-inference.yaml:

apiVersion: serving.kserve.io/v1beta1  
kind: InferenceService  
metadata:  
  name: corporate-speak-model-tensorflow  
spec:  
  predictor:  
    model:  
      modelFormat:  
        name: tensorflow  
      resources:  
        requests:  
          cpu: "250m"  
          memory: "1Gi"  
        limits:  
          cpu: "500m"  
          memory: "2Gi"  
      storageUri: kit://jozu.ml/<username>/<model-kit-name>:<version>  
Enter fullscreen mode Exit fullscreen mode

Replace the storageUri with your actual ModelKit reference from Jozu (username, repository name, and tag). The modelFormat: tensorflow tells KServe to use the TensorFlow serving runtime, while the resource requests and limits ensure your model has enough CPU and memory to handle inference without monopolizing cluster resources.

Deploy the Service

Apply all three manifests to your cluster:

kubectl apply -f kitops-jozu-secret.yaml  
kubectl apply -f kitops-storage-initializer.yaml  
kubectl apply -f kitops-kserve-inference.yaml  
Enter fullscreen mode Exit fullscreen mode

If successful, you'll see:

secret/jozu-registry-secret  
clusterstoragecontainer.serving.kserve.io/kitops created  
inferenceservice.serving.kserve.io/corporate-speak-model-tensorflow created  
Enter fullscreen mode Exit fullscreen mode

The deployment takes a few minutes as KServe pulls the ModelKit, unpacks it, and starts the inference pod. You can monitor the progress with:

kubectl get pods  
Enter fullscreen mode Exit fullscreen mode

Wait until you see your predictor pod running:

NAME                                                              READY   STATUS    RESTARTS   AGE  
corporate-speak-model-tensorflow-predictor-00001-deploymenwcc2n   2/2     Running   0          2m  
Enter fullscreen mode Exit fullscreen mode

Access the Inference Endpoint

Once the pod is running, find the service endpoint. You can do this by running:

kubectl get services | grep corporate-speak-model-tensorflow  
Enter fullscreen mode Exit fullscreen mode

You'll see several services created by KServe:

corporate-speak-model-tensorflow                           ExternalName   <none>           knative-local-gateway.istio-system.svc.cluster.local   <none>                                               20h  
corporate-speak-model-tensorflow-predictor                 ExternalName   <none>           knative-local-gateway.istio-system.svc.cluster.local   80/TCP                                               20h  
corporate-speak-model-tensorflow-predictor-00001           ClusterIP      10.103.234.235   <none>                                                 80/TCP,443/TCP                                       20h  
corporate-speak-model-tensorflow-predictor-00001-private   ClusterIP      10.104.180.43    <none>                                                 80/TCP,443/TCP,9090/TCP,9091/TCP,8022/TCP,8012/TCP   20h  
Enter fullscreen mode Exit fullscreen mode

For local testing, forward the private service to your machine:

kubectl port-forward service/corporate-speak-model-tensorflow-predictor-00001-private 8080:80  
Enter fullscreen mode Exit fullscreen mode

You should see:

Forwarding from 127.0.0.1:8080 -> 8012  
Forwarding from [::1]:8080 -> 8012  
Enter fullscreen mode Exit fullscreen mode

Now you can test your inference service.

Testing the Deployment with Tokenized Input

Before testing it is important to know that, KServe's standard TensorFlow serving runtime expects numerical tensors that correspond to the model's signature. Since our T5 model was fine-tuned using token IDs, we must tokenize the input locally before sending the request.

First, you'll need a quick script to generate the correct numerical payload. To do this, create a temporary Python script generate\_payload.py in your project root to handle the tokenization and generate the JSON payload:


import tensorflow as tf ## Required for Tensors  
from transformers import T5Tokenizer  
import json  
import os

MODEL\_SAVE\_PATH \= os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(\_\_file\_\_))), "1")   
tokenizer \= T5Tokenizer.from\_pretrained(MODEL\_SAVE\_PATH)  
MAX\_LENGTH \= 128  
term \= "Synergy" ## You can change the term here  
prompt \= f"term: {term}" ## T5 was trained to expect this prefix

inputs \= tokenizer(  
    prompt,  
    return\_tensors='tf',   
    max\_length=MAX\_LENGTH,  
    padding='max\_length',  
    truncation=True  
)

input\_ids\_list \= inputs['input\_ids'][0].numpy().tolist()  
attention\_mask\_list \= inputs['attention\_mask'][0].numpy().tolist()

payload \= {  
    "instances": [  
        {  
            "input\_ids": input\_ids\_list,  
            "attention\_mask": attention\_mask\_list ## KServe needs both for attention  
        }  
    ]  
}

with open('test\_payload.json', 'w') as f:  
    json.dump(payload, f, indent=2)  
Enter fullscreen mode Exit fullscreen mode

In a new terminal, run the script to create the file:

python3 generate\_payload.py  
Enter fullscreen mode Exit fullscreen mode

Now, use curl to send the generated test_payload.json file to the KServe endpoint.

curl -X POST http://localhost:8080/v1/models/corporate-speak-model-tensorflow:predict \\  
  -H "Content-Type: application/json" \\  
  -d @test\_payload.json  
Enter fullscreen mode Exit fullscreen mode

KServe will route the request containing the numerical IDs to the TensorFlow serving runtime, which passes it directly to the T5 model's generation function. You should see a JSON response with the decoded meaning:

{  
  "predictions": [  
    {  
      "output": "Synergy"  
    }  
  ]  
}  
Enter fullscreen mode Exit fullscreen mode

Scaling and Securing Your Deployment

Running a model in production requires thinking beyond basic functionality. As time goes on you will need autoscaling to handle traffic spikes, resource limits to prevent runaway costs, and security measures to protect your models and data. KServe and KitOps give you the tools to handle all of this without the need to build custom infrastructure.

Autoscaling with KServe

KServe integrates with Knative Serving to provide automatic scaling based on request load. By default, your InferenceService will scale down to zero replicas when idle and scale up as traffic increases. You can customize this behavior by adding autoscaling annotations to your InferenceService manifest.

To do this, edit your kitops-kserve-inference.yaml to include autoscaling configuration:

apiVersion: serving.kserve.io/v1beta1  
kind: InferenceService  
metadata:  
  name: corporate-speak-model-tensorflow  
  annotations:  
    autoscaling.knative.dev/target: "10"  
    autoscaling.knative.dev/minScale: "1"  
    autoscaling.knative.dev/maxScale: "5"  
spec:  
  predictor:  
    model:  
      modelFormat:  
        name: tensorflow  
      resources:  
        requests:  
          cpu: "250m"  
          memory: "1Gi"  
        limits:  
          cpu: "500m"  
          memory: "2Gi"  
      storageUri: kit://jozu.ml/<username>/<model-kit-name>:<version>  
Enter fullscreen mode Exit fullscreen mode

The target annotation sets the concurrency target per pod (10 requests), minScale ensures at least one pod is always running for faster response times, and maxScale caps the maximum number of replicas to 5, preventing runaway scaling costs. Knative will automatically add or remove pods based on incoming traffic patterns.

Resource Management

The resource limits in your InferenceService prevent a single model from consuming all cluster resources. The requests section tells Kubernetes how much CPU and memory to reserve, while limits sets the maximum the pod can use. For production deployments, you can tune these values based on your model's actual memory footprint and inference latency requirements.

If you're running multiple models, consider creating separate namespaces for isolation:

kubectl create namespace production-models  
kubectl apply -f kitops-kserve-inference.yaml -n production-models  
Enter fullscreen mode Exit fullscreen mode

This keeps production models separate from staging or experimental deployments and makes it easier to apply different resource quotas and network policies per environment.

Securing ModelKits with Cosign

ModelKit signing ensures that the artifacts you deploy haven't been tampered with between packaging and deployment. You can use Cosign to sign your ModelKits immediately after pushing them to Jozu:

cosign generate-key-pair  
cosign sign jozu.ml/<username>/<model-kit-name>:<version> --key cosign.key  
Enter fullscreen mode Exit fullscreen mode

This creates a cryptographic signature attached to your ModelKit. In production, you can configure KServe to verify signatures before pulling models, rejecting any unsigned or tampered artifacts. The signature verification happens during the storage initialization phase, before the model ever loads into memory.

Model Versioning and Rollback

One of KitOps' biggest advantages is version control for models. Every ModelKit you push to Jozu is immutable and tagged. If a new model version causes issues in production, rolling back is as simple as updating the storageUri in your InferenceService:

storageUri: kit://jozu.ml/<username>/<model-kit-name>:<the-previous-version>  
Enter fullscreen mode Exit fullscreen mode

Note: When a ModelKit is pushed to Jozu, it is automatically run through 5 different vulnerability scanning tools to ensure that your model is safe and secure. Jozu also creates a downloadable audit log, consisting of the model’s complete lineage.

Apply the change, and KServe will perform a blue-green deployment, spinning up new pods with the old model version while draining traffic from the problematic version. You can also use KServe's canary deployment features to test new model versions with a percentage of traffic before fully rolling out:

apiVersion: serving.kserve.io/v1beta1  
kind: InferenceService  
metadata:  
  name: corporate-speak-model-tensorflow  
spec:  
  predictor:  
    model:  
      modelFormat:  
        name: tensorflow  
      storageUri: kit://jozu.ml/<username>/<model-kit-name>:<a-new-version>  
  canaryTrafficPercent: 20  
Enter fullscreen mode Exit fullscreen mode

This routes 20% of traffic to the new model while keeping 80% on the stable version. Monitor metrics, and if everything looks good, increase the percentage until you're confident enough to promote the canary to full production.

Wrapping Up

Having a good model isn't enough to serve machine learning applications at scale. The combination of KitOps, Kubeflow, KServe, and Jozu brings software development best practices, like containerization, version control, and automated scaling, into the ML workflow. KitOps standardizes your LLM into a portable ModelKit for reproducible packaging and security, while KServe handles reliable, production-grade serving and automated scaling on Kubernetes, eliminating the need for custom engineering.

This guide demonstrated how to build a TensorFlow LLM, package it with KitOps, push it to an OCI registry, and deploy it using KServe on Kubernetes. The steps covered key operational patterns like configuring autoscaling, securing ModelKits with signatures, managing resource allocation across environments, and performing deployment rollbacks. This consistent methodology scales effortlessly from development environments like Minikube to high-volume production clusters like EKS, GKE, or on-premises systems.

To learn more about KitOps visit kitops.org. To try Jozu Hub in your private environment, you can contact the Jozu team to start a free two-week POC.

Top comments (0)