Stop Using CI Scripts to Validate Jupyter Notebooks. Use a Kubernetes Operator Instead.

#kubernetes #mlops #devops #jupyter

jupyter nbconvert --execute tells you the notebook ran. It doesn't tell you:

Whether it ran with the right GPU, memory limits, or node type
Whether the secrets it needs are actually accessible
Whether the model endpoint it calls is returning correct predictions
Whether cell outputs regressed from last week's golden baseline

That gap is why notebooks keep breaking in production despite green CI. The Jupyter Notebook Validator Operator closes it by running validation inside Kubernetes — same environment, same resources, same model endpoints as production.

Quick Start

# Clone and install CRDs
git clone https://github.com/tosin2013/jupyter-notebook-validator-operator.git
cd jupyter-notebook-validator-operator
make deploy IMG=quay.io/tosin2013/jupyter-notebook-validator-operator:latest

# Verify the controller is running
kubectl get pods -n jupyter-notebook-validator-operator-system

Then submit a validation job:

apiVersion: mlops.mlops.dev/v1alpha1
kind: NotebookValidationJob
metadata:
  name: my-notebook-validation
  namespace: mlops-staging
spec:
  notebook:
    git:
      url: "https://github.com/your-org/ml-models"
      ref: "main"
      path: "notebooks/inference.ipynb"
  podConfig:
    containerImage: "quay.io/jupyter/scipy-notebook:latest"
    resources:
      limits:
        memory: "8Gi"
  validation:
    goldenNotebook:
      enabled: true

kubectl apply -f validation-job.yaml
kubectl get notebookvalidationjob my-notebook-validation -w

How It Works

When you submit a NotebookValidationJob, the controller:

Clones your repo into an ephemeral volume (no manual file staging)
Schedules a validation pod with your exact resource spec — GPU nodes, memory limits, node selectors
Executes the notebook via Papermill — cell-by-cell, full output capture
Diffs outputs against the golden baseline (catches silent regressions)
Calls your model serving endpoint and validates predictions
Writes results back to the CR status — queryable with kubectl, Prometheus metrics included
Pod terminates

No persistent services. No management overhead.

Model-Aware Validation

💡 This is the differentiator. Most validators check for exceptions. This checks whether your model is actually behaving correctly.

Connect a validation job directly to your serving infrastructure:

validation:
  goldenNotebook:
    enabled: true
  modelEndpoints:
    - name: "fraud-model"
      type: "kserve"
      url: "http://fraud-model.kserve-inference.svc.cluster.local/v1/models/fraud-model:predict"

Supported serving platforms: KServe, OpenShift AI, vLLM, TorchServe, TensorFlow Serving, Triton Inference Server, Ray Serve, Seldon, BentoML.

Authentication via Kubernetes Secrets, External Secrets Operator, or HashiCorp Vault — no plaintext credentials in notebooks.

GPU Workloads

For GPU-dependent notebooks, set the resource limits and the scheduler handles the rest:

podConfig:
  containerImage: "quay.io/jupyter/pytorch-notebook:cuda12-latest"
  resources:
    limits:
      memory: "32Gi"
      nvidia.com/gpu: "1"
  nodeSelector:
    nvidia.com/gpu.product: "A100-SXM4-80GB"

The validation pod lands on a GPU node. Your CI runner doesn't need CUDA anywhere near it.

Debugging Failed Validations

When a job fails, check the CR status first:

kubectl describe notebookvalidationjob my-notebook-validation -n mlops-staging

The status.conditions block will show where it failed — clone, execution, golden diff, or model endpoint check.

For deeper inspection, grab the validation pod logs before they terminate (or increase the TTL in spec):

# Get the pod name from the CR status
kubectl logs -n mlops-staging <validation-pod-name> -c notebook-executor

💡 Tip: If the model endpoint check fails but execution passes, check your Kubernetes Secret first. The operator surfaces auth errors in the CR status under modelValidation.error.

Security

RBAC with minimal required permissions — controller only has the API access it needs
Pod Security Standards compliant validation pods
External Secrets Operator integration for secret rotation
Resource quotas to prevent runaway notebooks consuming cluster capacity
Runs safely in multi-tenant clusters without granting data scientists cluster-admin

Try It

⭐ Star the repo: github.com/tosin2013/jupyter-notebook-validator-operator

This is v0.1.0 — early days. If you're running notebooks in production and have opinions on the CRD design, serving platform integrations, or validation patterns, open an issue or drop a comment below.

Drop a ❤️ or 🦄 if this was useful — helps more platform engineers find it.