DEV Community

Cover image for Stop Using CI Scripts to Validate Jupyter Notebooks. Use a Kubernetes Operator Instead.
Tosin Akinosho
Tosin Akinosho

Posted on

Stop Using CI Scripts to Validate Jupyter Notebooks. Use a Kubernetes Operator Instead.

jupyter nbconvert --execute tells you the notebook ran. It doesn't tell you:

  • Whether it ran with the right GPU, memory limits, or node type
  • Whether the secrets it needs are actually accessible
  • Whether the model endpoint it calls is returning correct predictions
  • Whether cell outputs regressed from last week's golden baseline

That gap is why notebooks keep breaking in production despite green CI. The Jupyter Notebook Validator Operator closes it by running validation inside Kubernetes — same environment, same resources, same model endpoints as production.

Quick Start

# Clone and install CRDs
git clone https://github.com/tosin2013/jupyter-notebook-validator-operator.git
cd jupyter-notebook-validator-operator
make deploy IMG=quay.io/tosin2013/jupyter-notebook-validator-operator:latest

# Verify the controller is running
kubectl get pods -n jupyter-notebook-validator-operator-system
Enter fullscreen mode Exit fullscreen mode

Then submit a validation job:

apiVersion: mlops.mlops.dev/v1alpha1
kind: NotebookValidationJob
metadata:
  name: my-notebook-validation
  namespace: mlops-staging
spec:
  notebook:
    git:
      url: "https://github.com/your-org/ml-models"
      ref: "main"
      path: "notebooks/inference.ipynb"
  podConfig:
    containerImage: "quay.io/jupyter/scipy-notebook:latest"
    resources:
      limits:
        memory: "8Gi"
  validation:
    goldenNotebook:
      enabled: true
Enter fullscreen mode Exit fullscreen mode
kubectl apply -f validation-job.yaml
kubectl get notebookvalidationjob my-notebook-validation -w
Enter fullscreen mode Exit fullscreen mode

How It Works

When you submit a NotebookValidationJob, the controller:

  1. Clones your repo into an ephemeral volume (no manual file staging)
  2. Schedules a validation pod with your exact resource spec — GPU nodes, memory limits, node selectors
  3. Executes the notebook via Papermill — cell-by-cell, full output capture
  4. Diffs outputs against the golden baseline (catches silent regressions)
  5. Calls your model serving endpoint and validates predictions
  6. Writes results back to the CR status — queryable with kubectl, Prometheus metrics included
  7. Pod terminates

No persistent services. No management overhead.


Model-Aware Validation

💡 This is the differentiator. Most validators check for exceptions. This checks whether your model is actually behaving correctly.

Connect a validation job directly to your serving infrastructure:

validation:
  goldenNotebook:
    enabled: true
  modelEndpoints:
    - name: "fraud-model"
      type: "kserve"
      url: "http://fraud-model.kserve-inference.svc.cluster.local/v1/models/fraud-model:predict"
Enter fullscreen mode Exit fullscreen mode

Supported serving platforms: KServe, OpenShift AI, vLLM, TorchServe, TensorFlow Serving, Triton Inference Server, Ray Serve, Seldon, BentoML.

Authentication via Kubernetes Secrets, External Secrets Operator, or HashiCorp Vault — no plaintext credentials in notebooks.


GPU Workloads

For GPU-dependent notebooks, set the resource limits and the scheduler handles the rest:

podConfig:
  containerImage: "quay.io/jupyter/pytorch-notebook:cuda12-latest"
  resources:
    limits:
      memory: "32Gi"
      nvidia.com/gpu: "1"
  nodeSelector:
    nvidia.com/gpu.product: "A100-SXM4-80GB"
Enter fullscreen mode Exit fullscreen mode

The validation pod lands on a GPU node. Your CI runner doesn't need CUDA anywhere near it.


Debugging Failed Validations

When a job fails, check the CR status first:

kubectl describe notebookvalidationjob my-notebook-validation -n mlops-staging
Enter fullscreen mode Exit fullscreen mode

The status.conditions block will show where it failed — clone, execution, golden diff, or model endpoint check.

For deeper inspection, grab the validation pod logs before they terminate (or increase the TTL in spec):

# Get the pod name from the CR status
kubectl logs -n mlops-staging <validation-pod-name> -c notebook-executor
Enter fullscreen mode Exit fullscreen mode

💡 Tip: If the model endpoint check fails but execution passes, check your Kubernetes Secret first. The operator surfaces auth errors in the CR status under modelValidation.error.


Security

  • RBAC with minimal required permissions — controller only has the API access it needs
  • Pod Security Standards compliant validation pods
  • External Secrets Operator integration for secret rotation
  • Resource quotas to prevent runaway notebooks consuming cluster capacity
  • Runs safely in multi-tenant clusters without granting data scientists cluster-admin

Try It

Star the repo: github.com/tosin2013/jupyter-notebook-validator-operator

This is v0.1.0 — early days. If you're running notebooks in production and have opinions on the CRD design, serving platform integrations, or validation patterns, open an issue or drop a comment below.

Drop a ❤️ or 🦄 if this was useful — helps more platform engineers find it.

Top comments (0)