DEV Community

Cover image for Top Open Source Tools for Kubernetes ML: From Development to Production
Jesse Williams for Jozu

Posted on

Top Open Source Tools for Kubernetes ML: From Development to Production

Running machine learning on Kubernetes has evolved from experimental curiosity to production necessity. But with hundreds of tools claiming to solve ML (machine learning) deployment, which ones should you consider? This guide cuts through the noise, presenting the essential open source tools that real teams use to build, package, deploy, and monitor ML models on Kubernetes. Most of these tools are fairly well known, however, I tried to incorporate a few emerging and lesser known tools.

This post covers the complete lifecycle, from notebook experimentation to production serving, with battle-tested tools for each stage.

Timing Note: With KubeCon + CloudNativeCon North America 2025 kicking off November 10-13 in Atlanta, GA (celebrating the CNCF's 10th anniversary), Kubernetes ML is hotter than ever. Sessions on AI/ML workflows, scalable inference, and secure model deployment are packed, reflecting the explosive growth in cloud-native AI. If you're attending, don't miss the talks on emerging standards like KitOps, ModelPack, and Jozu, where our team will dive deep into packaging AI artifacts for Kubernetes at scale. It's the perfect spot to see how these tools fit into real-world MLOps stacks.

Why Kubernetes for ML?

Before diving into tools, let's address the elephant in the room: why is Kubernetes so popular for ML?

The answer is simple: production reality. Your models need to scale, recover from failures, integrate with existing systems, and meet security requirements. Kubernetes already handles this for your applications. Why build a parallel infrastructure for ML when you can leverage what you already have?

The challenge is that ML workloads differ from traditional applications. Models need GPUs, datasets require versioning, experiments demand reproducibility, and deployments need specialized serving infrastructure. Generic Kubernetes won't cut it, you need ML-specific tools that understand these requirements.


Stage 1: Model Sourcing & Foundation Models

Most organizations won't train foundation models from scratch, they need reliable sources for pre-trained models and ways to adapt them for specific use cases.

Hugging Face Hub

What it does: Provides access to thousands of pre-trained models with standardized APIs for downloading, fine-tuning, and deployment. Hugging Face has become the go-to starting point for most AI/ML projects.

Why it matters: Training GPT-scale models costs millions. Hugging Face gives you immediate access to state-of-the-art models like Llama, Mistral, and Stable Diffusion that you can fine-tune for your specific needs. The standardized model cards and licenses help you understand what you're deploying.

Model Garden (GCP) / Model Zoo (AWS) / Model Catalog (Azure)

What it does: Cloud-provider catalogs of pre-trained and optimized models ready for deployment on their platforms. The platforms themselves aren’t open source, however, they do host open source models and don’t typically charge for accessing these models.

Why it matters: These catalogs provide optimized versions of open source models with guaranteed performance on specific cloud infrastructure. If you’re reading this post you’re likely planning on deploying your model on Kubernetes, and these models are optimized for a vendor specific Kubernetes build like AKS, EKS, and GKS. They handle the complexity of model optimization and hardware acceleration. However, be aware of indirect costs like compute for running models, data egress fees if exporting, and potential vendor lock-in through proprietary optimizations (e.g., AWS Neuron or GCP TPUs). Use them as escape hatches if you're already committed to that cloud ecosystem and need immediate SLAs; otherwise, prioritize neutral sources to maintain flexibility.


Stage 2: Development & Experimentation

Data scientists need environments that support interactive development while capturing experiment metadata for reproducibility.

Kubeflow Notebooks

What it does: Provides managed Jupyter environments on Kubernetes with automatic resource allocation and persistent storage.

Why it matters: Data scientists get familiar Jupyter interfaces without fighting for GPU resources or losing work when pods restart. Notebooks automatically mount persistent volumes, connect to data lakes, and scale resources based on workload.

NBDev

What it does: A framework for literate programming in Jupyter notebooks, turning them into reproducible packages with automated testing, documentation, and deployment.

Why it matters: Traditional notebooks suffer from hidden state and execution order problems. NBDev enforces determinism by treating notebooks as source code, enabling clean exports to Python modules, CI/CD integration, and collaborative development without the chaos of ad-hoc scripting.

Pluto.jl

What it does: Reactive notebooks in Julia that automatically re-execute cells based on dependency changes, with seamless integration to scripts and web apps.

Why it matters: For Julia-based ML workflows (common in scientific computing), Pluto eliminates execution order issues and hidden state, making experiments truly reproducible. It's lightweight and excels in environments where performance and reactivity are key, bridging notebooks to production Julia pipelines.

MLflow

What it does: Tracks experiments, parameters, and metrics across training runs with a centralized UI for comparison.

Why it matters: When you're running hundreds of experiments, you need to know which hyperparameters produced which results. MLflow captures this automatically, making it trivial to reproduce winning models months later.

DVC (Data Version Control)

What it does: Versions large datasets and model files using git-like semantics while storing actual data in object storage.

Why it matters: Git can't handle 50GB datasets. DVC tracks data versions in git while storing files in S3/GCS/Azure, giving you reproducible data pipelines without repository bloat.


Stage 3: Training & Orchestration

Training jobs need to scale across multiple nodes, handle failures gracefully, and optimize resource utilization.

Kubeflow Training Operators

What it does: Provides Kubernetes-native operators for distributed training with TensorFlow, PyTorch, XGBoost, and MPI.

Why it matters: Distributed training is complex, managing worker coordination, failure recovery, and gradient synchronization. Training operators handle this complexity through simple YAML declarations.

Volcano

What it does: Batch scheduling system for Kubernetes optimized for AI/ML workloads with gang scheduling and fair-share policies.

Why it matters: Default Kubernetes scheduling doesn't understand ML needs. Volcano ensures distributed training jobs get all required resources simultaneously, preventing deadlock and improving GPU utilization.

Argo Workflows

What it does: Orchestrates complex ML pipelines as DAGs with conditional logic, retries, and artifact passing.

Why it matters: Real ML pipelines aren't linear, they involve data validation, model training, evaluation, and conditional deployment. Argo handles this complexity while maintaining visibility into pipeline state.

Flyte

What it does: A strongly-typed workflow orchestration platform for complex data and ML pipelines, with built-in caching, versioning, and data lineage.

Why it matters: Flyte simplifies authoring pipelines in Python (or other languages) with type safety and automatic retries, reducing boilerplate compared to raw Argo YAML. It's ideal for teams needing reproducible, versioned workflows without sacrificing flexibility.

Kueue

What it does: Kubernetes-native job queuing and resource management for batch workloads, with quota enforcement and workload suspension.

Why it matters: For smaller teams or simpler setups, Kueue provides lightweight gang scheduling and queuing without Volcano's overhead, integrating seamlessly with Kubeflow for efficient resource sharing in multi-tenant clusters.


Stage 4: Packaging & Registry

Models aren't standalone, they need code, data references, configurations, and dependencies packaged together for reproducible deployment. The classic Kubernetes ML stack (Kubeflow for orchestration, KServe for serving, and MLflow for tracking) excels here but often leaves packaging as an afterthought, leading to brittle handoffs between data science and DevOps. Enter KitOps, a CNCF Sandbox project that's emerging as the missing link: it standardizes AI/ML artifacts as OCI-compliant ModelKits, integrating seamlessly with Kubeflow's pipelines, MLflow's registries, and KServe's deployments. Backed by Jozu, KitOps bridges the gap, enabling secure, versioned packaging that fits right into your existing stack without disrupting workflows.

KitOps

What it does: Packages complete ML projects (models, code, datasets, configs) as OCI artifacts called ModelKits that work with any container registry. It now supports signing ModelKits with Cosign, generating Software Bill of Materials (SBOMs) for dependency tracking, and monthly releases for stability.

Why it matters: Instead of tracking "which model version, which code commit, which config file" separately, you get one immutable reference with built-in security features like signing and SBOMs for vulnerability scanning. Your laptop, staging, and production all pull the exact same project state, now with over 1,100 GitHub stars and CNCF backing for enterprise adoption. In the Kubeflow-KServe-MLflow triad, KitOps handles the "pack" step, pushing ModelKits to OCI registries for direct consumption in Kubeflow jobs or KServe inferences, reducing deployment friction by 80% in teams we've seen.

ORAS (OCI Registry As Storage)

What it does: Extends OCI registries to store arbitrary artifacts beyond containers, enabling unified artifact management.

Why it matters: You already have container registries with authentication, scanning, and replication. ORAS lets you store models there too, avoiding separate model registry infrastructure.

BentoML

What it does: Packages models with serving code into "bentos", standardized bundles optimized for cloud deployment.

Why it matters: Models need serving infrastructure: API endpoints, batch processing, monitoring. BentoML bundles everything together with automatic containerization and optimization.


Stage 5: Serving & Inference

Models need to serve predictions at scale with low latency, high availability, and automatic scaling.

KServe

What it does: Provides serverless inference on Kubernetes with automatic scaling, canary deployments, and multi-framework support.

Why it matters: Production inference isn't just loading a model, it's handling traffic spikes, A/B testing, and gradual rollouts. KServe handles this complexity while maintaining sub-second latency.

Seldon Core

What it does: Advanced ML deployment platform with explainability, outlier detection, and multi-armed bandits built-in.

Why it matters: Production models need more than predictions, they need explanation, monitoring, and feedback loops. Seldon provides these capabilities without custom development.

NVIDIA Triton Inference Server

What it does: High-performance inference serving optimized for GPUs with support for multiple frameworks and dynamic batching.

Why it matters: GPU inference is expensive, you need maximum throughput. Triton optimizes model execution, shares GPUs across models, and provides metrics for capacity planning.

llm-d

What it does: A Kubernetes-native framework for distributed LLM inference, supporting wide expert parallelism, disaggregated serving with vLLM, and multi-accelerator compatibility (NVIDIA GPUs, AMD GPUs, TPUs, XPUs).

Why it matters: For large-scale LLM deployments, llm-d excels in reducing latency and boosting throughput via advanced features like predicted latency balancing and prefix caching over fast networks. It's ideal for MoE models like DeepSeek, offering a production-ready path for high-scale serving without vendor lock-in.


Stage 6: Monitoring & Governance

Production models drift, fail, and misbehave. You need visibility into model behavior and automated response to problems.

Evidently AI

What it does: Monitors data drift, model performance, and data quality with interactive dashboards and alerts.

Why it matters: Models trained on last year's data won't work on today's. Evidently detects when input distributions change, performance degrades, or data quality issues emerge.

Prometheus + Grafana

What it does: Collects and visualizes metrics from ML services with customizable dashboards and alerting.

Why it matters: You need unified monitoring across infrastructure and models. Prometheus already monitors your Kubernetes cluster, extending it to ML metrics gives you single-pane-of-glass visibility.

Kyverno

What it does: Kubernetes-native policy engine for enforcing declarative rules on resources, including model deployments and access controls.

Why it matters: Simpler than general-purpose tools, Kyverno integrates directly with Kubernetes admission controllers to enforce policies like "models must pass scanning" or "restrict deployments to approved namespaces," without the overhead of external services.

Fiddler Auditor

What it does: Open-source robustness library for red-teaming LLMs, evaluating prompts for hallucinations, bias, safety, and privacy before production.

Why it matters: For LLM-heavy workflows, Fiddler Auditor provides pre-deployment testing with metrics on correctness and robustness, helping catch issues early in the pipeline.

Model Cards (via MLflow or Hugging Face)

What it does: Standardized documentation for models, including performance metrics, ethical considerations, intended use, and limitations.

Why it matters: Model cards promote transparency and governance by embedding metadata directly in your ML artifacts, enabling audits and compliance without custom tooling.


Putting It All Together: A Production ML Platform

Here's how these tools combine into a complete platform, now with a clearer separation of concerns for data science and platform teams. At its core, the go-to Kubernetes ML stack (Kubeflow for end-to-end orchestration, KServe for scalable serving, and MLflow for experiment tracking) provides a solid foundation. But to close the loop on packaging and secure artifact management, KitOps slots in perfectly as the OCI-standardized "glue," bundling MLflow-tracked models into verifiable ModelKits for seamless Kubeflow pipelines and KServe rollouts. For teams scaling to production, Jozu's open-source contributions (including KitOps and the new ModelPack spec) add enterprise-grade registry and orchestration layers without lock-in.

Development: Data scientists work in Kubeflow Notebooks or NBDev/Pluto.jl for reproducible experiments, tracking runs with MLflow while DVC manages their datasets.

Training: Flyte or Argo Workflows orchestrates training pipelines, using Kubeflow Training Operators for distributed training and Volcano or Kueue for intelligent scheduling.

Model Sourcing: Teams pull foundation models from Hugging Face Hub for fine-tuning or run them locally with Ollama for testing.

Packaging: Trained models get packaged as KitOps ModelKits (with signing and SBOMs) or BentoML bundles, pushed to registries via ORAS, now interoperable with the ModelPack spec for broader ecosystem compatibility.

Serving: KServe handles standard deployments, llm-d or Triton optimizes LLM/GPU inference, and Seldon Core adds explainability where needed.

Monitoring: Evidently AI watches for drift, Prometheus/Grafana tracks metrics, Fiddler Auditor evaluates LLMs pre-prod, and Kyverno enforces governance policies with Model Cards for documentation.

This isn't theoretical, it's how leading organizations run ML in production today, often splitting into a "sandbox" for data scientists (e.g., Notebooks + MLflow) and a hardened platform for engineers (e.g., Flyte + KServe). A European logistics company managing 400+ models uses exactly this stack, reducing deployment time from weeks to hours while maintaining 99.95% availability.


Security Considerations

Open source doesn't mean insecure, but it does mean you're responsible for security. Critical considerations:

Supply Chain Security: Models can contain malicious code. Scan model artifacts for embedded exploits before deployment. Tools like ModelScan detect serialization attacks in pickle files. Leverage KitOps for built-in SBOM generation to track dependencies and vulnerabilities.

Access Control: Use Kubernetes RBAC to control who can deploy models. Integrate with enterprise identity providers for authentication, and enforce via Kyverno policies.

Audit Trails: Log all model deployments, updates, and access. Immutable artifacts like ModelKits provide natural audit points; sign them with Cosign and record in Rekor for verifiable provenance.

Vulnerability Scanning: Scan model dependencies for CVEs using tools like Trivy or Grype on SBOMs. For runtime protection, use sandboxing with gVisor or Firecracker. Block unsigned or unscanned ModelKits at admission with Kyverno or Gatekeeper.

Model Signing and Attestations: Always sign ModelKits with Cosign and add in-toto attestations (e.g., dataset hashes, framework versions). This prevents RCE risks from untrusted loads.


Anti-Patterns to Avoid

Building Everything Yourself: These tools exist because hundreds of teams already learned these lessons. Don't rebuild MLflow because you want "something simpler."

Ignoring Kubernetes Patterns: ML on Kubernetes works best when you follow Kubernetes patterns. Use operators, not custom scripts. Use persistent volumes, not local storage.

Treating Models Like Code: Models aren't code, they're data plus code plus configuration. Tools that treat them as pure code artifacts will frustrate your team.

Premature Optimization: Start simple. You don't need Triton's GPU optimization for your first model. You don't need distributed training for datasets under 10GB.

Golden Stack Syndrome: Adopting 15 tools because "FAANG does it." Result: 6-month integration hell, $500k burned, 0 models in prod. Pick a minimal viable path and iterate based on real pain.


Getting Started

Pick one model, one use case, and four tools:

  1. Track it with MLflow
  2. Package it with KitOps
  3. Deploy it with KServe
  4. Monitor it with Prometheus

Get this working end-to-end before adding more tools. Each tool you add should solve a specific problem you're actually experiencing, not a theoretical concern.

The beauty of open source is iteration without lock-in. Start small, learn what works for your team, and evolve your platform based on real needs rather than vendor roadmaps.


Conclusion

Kubernetes ML has matured from science experiment to production reality. The tools listed here aren't just technically sound, they're proven in production by organizations betting billions on ML outcomes.

The key insight: you don't need to choose between data science productivity and production reliability. Modern open source tools deliver both, letting data scientists experiment freely while platform engineers sleep soundly.

Your ML platform should leverage your existing Kubernetes investment, not replace it. These tools integrate with the Kubernetes ecosystem you already trust, extending it with ML-specific capabilities rather than building parallel infrastructure.

Start with the basics: development, packaging, and serving. Add training orchestration and monitoring as you scale. Let your platform grow with your ML maturity rather than building for requirements you might never have.

The path from notebook to production doesn't have to be painful. With the right open source tools on Kubernetes, it can be as straightforward as deploying any other application, just with better math.

Top comments (0)