DEV Community: Mohammad Heydari

Kubernetes in LLMOps (Part 2): GPU Efficiency, Cost Engineering, and Real-World Failure Modes

Mohammad Heydari — Tue, 23 Jun 2026 10:43:05 +0000

Introduction: Scaling Is Easy, Efficiency Is Not

By the time a team reaches Kubernetes in their LLM journey, they usually solve one class of problems: orchestration.

Services restart automatically. Deployments become safer. Scaling becomes possible.

But a new class of problems emerges subtler, more expensive, and far more difficult to fix:

GPUs are allocated but underutilized
Costs grow faster than traffic
Latency improves… until it suddenly doesn’t
Systems appear healthy but perform poorly

At this stage, the challenge is no longer making the system work.

The challenge is making it efficient, predictable, and economically viable.

The GPU Utilization Paradox

One of the most counterintuitive realities in LLM systems is this:

You can have 100% GPU allocation and still have terrible efficiency.

This happens because allocation is not utilization.

A typical inference workload behaves like this:

It processes requests in bursts
It waits for new requests
It suffers from memory fragmentation
It is constrained by batch size and token generation speed

As a result, a GPU may be “busy” from a scheduler’s perspective but idle from a compute perspective.

This is the GPU utilization paradox.

Batching: The Most Powerful (and Misused) Optimization

Batching is often introduced as a simple idea: process multiple requests together to maximize GPU throughput.

In practice, batching is one of the most delicate trade-offs in LLM systems.

Larger batches:

Increase throughput
Improve GPU efficiency
Reduce cost per request

But they also:

Increase latency for individual users
Introduce queuing delays
Complicate scheduling

The real challenge is not enabling batching—it is controlling it dynamically.

A production system must continuously balance:

Queue length
Latency targets
GPU saturation

This often leads to adaptive batching strategies, where batch size changes based on real-time conditions.

Kubernetes does not implement batching—but it enables architectures (queue + workers) where batching becomes possible.

Model Multiplexing: Running More with Less

Another advanced optimization is model multiplexing—running multiple models on a single GPU.

At first glance, this seems like an obvious way to improve utilization. But in practice, it introduces significant complexity:

Memory contention between models
Unpredictable latency due to shared resources
Difficult debugging when performance degrades

The key insight is that multiplexing is not just a technical problem—it is a scheduling problem.

You must decide:

Which models can safely share a GPU
How to isolate workloads
How to prioritize requests

Kubernetes can assist through node-level isolation and resource constraints, but the logic of multiplexing often lives at the application layer.

MIG: Hardware-Level Partitioning

For workloads that require stronger isolation, Multi-Instance GPU (MIG) provides a hardware-level solution.

Instead of sharing a GPU dynamically, MIG partitions it into smaller, independent units.

This allows you to:

Run multiple inference workloads in isolation
Reduce contention
Improve predictability

However, MIG introduces its own trade-offs:

Reduced flexibility compared to full GPUs
Fixed partition sizes
More complex scheduling requirements

In Kubernetes, MIG-enabled GPUs can be exposed as separate resources, allowing more granular scheduling.

Cost Engineering: The Missing Discipline in LLMOps

Most teams think about scaling before they think about cost.

This is a mistake.

In LLM systems, cost is not a byproduct—it is a first-class constraint.

A poorly optimized system can easily cost 5–10x more than necessary without delivering better performance.

Key cost drivers include:

GPU idle time
Over-provisioned replicas
Inefficient batching
Redundant computation (lack of caching)

Cost engineering requires visibility and control.

You need to understand:

Cost per request
Cost per token
Cost per user session

And then design your system to optimize these metrics.

Kubernetes helps by enabling:

Fine-grained scaling
Resource limits
Workload isolation

But cost efficiency ultimately depends on system design decisions.

Failure Modes You Only See in Production

Some failures only emerge at scale. They do not appear in testing or staging environments.

1. Silent Latency Degradation

The system does not crash. It does not throw errors.

It just becomes slower.

This is often caused by:

Retriever bottlenecks
Cache inefficiencies
Suboptimal batching

These issues are difficult to detect without proper observability.

2. GPU Memory Fragmentation

Over time, repeated allocations and deallocations lead to memory fragmentation.

Even if total memory is sufficient, large contiguous blocks may not be available.

This results in:

Unexpected OOM errors
Pod crashes under seemingly safe conditions

Restarting pods temporarily fixes the issue—but does not solve the root cause.

3. Thundering Herd Problem

A sudden spike in traffic (or cache miss) causes a flood of requests to hit the system simultaneously.

This leads to:

Queue explosion
Increased latency
Cascading failures

Mitigation strategies include:

Rate limiting
Request deduplication
Better caching

4. Cold Start Amplification

When scaling up, new pods need time to load models.

During this time:

Existing pods become overloaded
Latency increases
Autoscaling may overreact

This creates a feedback loop that destabilizes the system.

Debugging in a Distributed LLM System

Debugging LLM systems is fundamentally different from debugging traditional applications.

You are not just debugging code—you are debugging interactions between services.

A typical debugging workflow might involve:

Tracing a request across API, retriever, and model
Inspecting queue delays
Analyzing GPU utilization patterns
Correlating logs across multiple pods

This requires:

Structured logging
Distributed tracing
Time-synchronized metrics

Kubernetes provides the environment—but effective debugging requires discipline in instrumentation.

Designing for Predictability, Not Just Performance

A common mistake is optimizing purely for peak performance.

But in production systems, predictability is often more valuable than raw speed.

Users tolerate slightly slower responses.
They do not tolerate inconsistent behavior.

Designing for predictability means:

Avoiding extreme batching strategies
Isolating workloads when necessary
Prioritizing stable latency over maximum throughput

Kubernetes helps enforce these constraints through resource limits and isolation mechanisms.

The Evolution of an LLM System

Most LLM systems evolve through stages:

Prototype (single service, no orchestration)
Early production (basic scaling, manual fixes)
Orchestrated system (Kubernetes, microservices)
Optimized system (cost-aware, efficient, observable)

Many teams reach stage 3 and stop.

But real competitive advantage lies in stage 4.

Conclusion: The Real Work Begins After Deployment

Kubernetes solves the problem of orchestration.

But orchestration is only the beginning.

The real challenges in LLMOps are:

Efficient GPU utilization
Cost control
Failure handling at scale
System predictability

These are not problems you solve once.

They are problems you continuously manage.

And the teams that do this well are the ones that turn AI capabilities into real, sustainable products.

What’s Next (Part 3)

In Part 3, we will explore:

Real-world architecture patterns (RAG at scale, streaming inference)
Advanced scheduling strategies
Hybrid cloud and on-prem GPU setups
Lessons learned from production incidents

Because at scale, every design decision becomes an operational decision.

Kubernetes in LLMOps (Part 1): Building Production-Grade AI Systems on Top of Chaos

Mohammad Heydari — Tue, 23 Jun 2026 10:35:00 +0000

Introduction: The Day Your Demo Dies

Every LLM engineer has a moment like this.

Your demo works flawlessly. A clean API, a responsive model, maybe even a RAG pipeline that feels “intelligent.” You deploy it, share it, and everything looks promising.

Then real users arrive.

Requests start piling up. Latency becomes unpredictable. Some responses take seconds, others timeout. GPU memory spikes. One of your services crashes—and suddenly the entire pipeline stops responding.

Nothing fundamentally changed in your code.

What changed is the environment.

You moved from a controlled, single-user system into a concurrent, distributed, failure-prone reality.

And this is where most LLM systems break—not because the model is weak, but because the system around it is fragile.

The Hidden Complexity of “Simple” LLM Apps

At a glance, an LLM application feels like a linear pipeline:

User → API → Model → Response

But in production, this abstraction collapses.

What you actually have is a graph of interdependent services:

An API gateway handling authentication, rate limiting, and routing
A model inference service constrained by GPU memory and throughput
An embedding service generating vectors under heavy load
A retriever querying a vector database with unpredictable latency
A cache layer attempting to mask inefficiencies
A storage layer maintaining session state

Each of these components behaves differently under stress. More importantly, they fail differently.

A retriever slowdown doesn’t crash your system—it silently increases latency.
A GPU OOM doesn’t degrade performance—it kills the pod.
A cache miss storm doesn’t look like an error—it looks like increased compute cost.

This is not an application anymore.

It is a distributed system with non-linear failure modes.

Why Traditional Deployment Models Collapse?

Before Kubernetes, teams often rely on a mix of Docker, VMs, and manual scaling strategies. This approach works at small scale but introduces systemic issues as complexity grows.

The first problem is that infrastructure becomes imperative rather than declarative. Engineers manually decide where things run, how they restart, and how they scale. Over time, this leads to configuration drift and unpredictable behavior.

The second issue is resource fragmentation. GPU workloads, in particular, suffer from inefficient allocation. A model that uses half a GPU still blocks the entire device. Multiply this across services, and your infrastructure cost explodes without a corresponding increase in throughput.

The third issue is deployment risk. Updating a model or service becomes a high-stakes operation. Since model initialization is expensive, even a small deployment can introduce noticeable downtime.

Finally, failure recovery is inconsistent. Some services restart automatically, others require manual intervention. Observability is partial at best, making root-cause analysis slow and unreliable.

At this stage, the system is no longer simple. It is just unmanaged complexity masquerading as simplicity.

Kubernetes as a Control Plane, Not Just a Scheduler

Kubernetes is often misunderstood as a container orchestrator. In the context of LLMOps, it is more accurate to think of it as a control plane for distributed AI systems.

Instead of managing processes, you define desired system behavior:

How many instances of each service should run
What resources they require
How they communicate
How they recover from failure

Kubernetes continuously reconciles the actual state of the system with this desired state.

This reconciliation loop is what transforms fragile infrastructure into resilient systems.

In LLM workloads, where failures are frequent and resource demands are dynamic, this control loop becomes essential.

A Realistic LLMOps Architecture on Kubernetes

Below is a simplified but realistic architecture for a production LLM system:

                ┌──────────────┐
                │   API Layer  │
                └──────┬───────┘
                       │
                ┌──────▼───────┐
                │ Request Queue│ (Kafka / Redis)
                └──────┬───────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
 ┌──────▼──────┐ ┌─────▼─────┐ ┌──────▼──────┐
 │ LLM Workers │ │ Retriever │ │ Embeddings  │
 │  (GPU Pods) │ │  Service  │ │  Service    │
 └──────┬──────┘ └─────┬─────┘ └──────┬──────┘
        │              │              │
        └──────┬───────┴───────┬──────┘
               │               │
        ┌──────▼──────┐ ┌──────▼──────┐
        │  Vector DB  │ │    Cache     │
        └─────────────┘ └─────────────┘

What matters here is not the components themselves, but their independence.

Each box can scale, fail, and recover independently.

This is the core principle Kubernetes enables.

GPU Scheduling: Where Most Systems Fail

GPU management is the hardest unsolved problem in many LLM systems.

Unlike CPUs, GPUs are:

Scarce
Expensive
Difficult to share
Highly sensitive to workload patterns

Kubernetes introduces a way to treat GPUs as schedulable resources. With the NVIDIA device plugin, you can request GPUs in your workloads just like CPU and memory.

However, naive GPU allocation leads to severe inefficiencies.

If each model instance claims a full GPU, but only uses a fraction of it, you end up with a cluster that is “fully allocated” but underutilized.

To address this, advanced techniques are required:

Multi-Instance GPU (MIG) for partitioning hardware
Request batching to increase utilization
Running multiple models per GPU with careful isolation

Kubernetes does not solve these problems automatically—but it provides the framework in which these optimizations become possible.

Example: GPU-Aware Deployment in Kubernetes

A simplified deployment might look like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm-container
        image: your-llm-image
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000

This ensures that each pod is scheduled on a node with an available GPU.

But in practice, this is only the starting point. Real systems require tuning for batching, concurrency, and memory management.

Autoscaling: Why CPU Metrics Lie

One of the most subtle problems in LLMOps is autoscaling.

Traditional systems scale based on CPU usage. But in LLM workloads, CPU is rarely the bottleneck.

Instead, performance is constrained by:

GPU saturation
Request queue length
Token generation latency

A system can have low CPU usage while users experience high latency.

This leads to a critical insight:

Autoscaling must be driven by user experience, not infrastructure metrics.

This often requires custom metrics pipelines, where scaling decisions are based on queue depth or response time rather than CPU percentage.

Decoupling for Throughput: The Queue Pattern

One of the most impactful architectural decisions in LLM systems is introducing a queue between request ingestion and processing.

Without a queue, each request is handled synchronously, tying API latency directly to model performance.

With a queue:

Requests are buffered
Workers pull tasks asynchronously
Batching becomes possible

This transforms the system from request-driven to throughput-optimized.

Kubernetes enables this pattern by allowing independent scaling of API pods and worker pods, ensuring that each layer can be optimized separately.

Observability: The Only Way to Stay Sane

In complex LLM systems, most issues are emergent. They arise from interactions between components rather than isolated failures.

This makes observability critical.

You need to answer questions like:

Is latency caused by the retriever or the model?
Are GPUs underutilized or saturated?
Are requests being queued or dropped?

Without metrics, logs, and tracing, these questions are unanswerable.

Kubernetes provides the foundation for integrating observability tools, but the responsibility of instrumenting the system remains with the engineers.

Conclusion: Kubernetes as the Boundary Between Chaos and Control

LLM systems are inherently complex. They combine heavy compute workloads, distributed architectures, and unpredictable traffic patterns.

Without orchestration, this complexity manifests as instability, inefficiency, and operational risk.

Kubernetes does not eliminate complexity—but it gives you a way to manage it.

It introduces structure where there was chaos.
It enables scaling where there was limitation.
It provides resilience where there was fragility.

And most importantly, it allows you to think in terms of systems rather than scripts.

In the world of LLMOps, that shift is everything.

What’s Next (Part 2)

In the next part, we will go deeper into:

Advanced GPU utilization strategies (MIG, multiplexing)
Cost optimization patterns for LLM workloads
Real-world failure case studies
Production debugging strategies

Because building the system is only half the challenge.

Operating it is where the real engineering begins.

Designing a Synthetic Data Pipeline for Persian LLM Fine Tuning: From Topic Graphs to QLoRA Evaluation

Mohammad Heydari — Mon, 22 Jun 2026 17:44:08 +0000

Introduction: Why this project matters?

Training instruction following LLMs is no longer just about scaling models. It is about scaling data quality.
In high resource languages like English, datasets such as Alpaca and OpenAssistant already exist. However, in low resource languages like Persian, high quality instruction datasets are extremely limited.

Most available Persian corpora suffer from:

• lack of instruction structure
• Arabic language contamination
• low diversity
• poor alignment quality

As a result, even strong base models fail to:

• follow instructions consistently
• generate fluent Persian
• maintain coherent structure

The core bottleneck is not model capacity but data scarcity.

This project addresses that problem through a full synthetic data generation and fine tuning pipeline.

System Overview: End to End Pipeline
The system is designed as a modular data engine:

> Topic Tree > LLM Generation > Deduplication > Quality Scoring > Dataset Export > QLoRA Fine Tuning > Evaluation

Each component is independent, allowing scalability and reproducibility.

Core Design Philosophy: Controlled Diversity
Instead of free form generation, a structured topic tree is used with:

• 51 domains
• approximately 350 subtopics

This ensures balanced coverage and prevents mode collapse.
Multi layer Filtering Raw synthetic data is inherently noisy. The system applies multiple filtering stages:

• semantic deduplication
• LLM based quality scoring

This transforms raw outputs into curated training data.
Model Agnostic Design. The pipeline supports multiple models across stages:

• GPT 4.1 mini and GPT 4.1 nano for generation
• second LLM for evaluation
• Qwen2.5 3B Instruct for fine tuning

This makes the system reusable across languages and domains.

Data Generation Engine
Prompting Strategy

Each generation call produces structured instruction data:

{ "instruction": "How can I prepare for university entrance exams?", "input": "", "output": "To prepare for entrance exams, you should...", "topic": "Education", "subtopic": "Entrance Exams" }

Generation Configuration

Key parameters include:

• pairs per call: 3
• calls per subtopic: 2
• max tokens: 1500
• delay between calls: 0.3 seconds

These parameters balance cost, diversity, and stability.

Multi model generation

Using multiple models reduces bias and increases diversity:

• GPT 4.1 mini provides structured reasoning
• GPT 4.1 nano increases variation and reduces cost

Deduplication Layer : Semantic Filtering

Synthetic datasets often contain semantically similar entries.
Example:

• “How to reduce stress?”
• “Methods for anxiety control”

Although different in wording, both represent the same intent.
To address this, embedding based similarity is used:

if similarity(instruction_a, instruction_b) > 0.75 : remove duplicate

This step preserves semantic diversity and prevents overfitting on repetitive patterns.

Quality Scoring : LLM as a Judge

After deduplication, data is evaluated using a second LLM.
Each sample is scored based on:

Fluency
Naturalness and grammatical correctness of language

Relevance
Whether the response correctly addresses the instruction

Completeness
Whether the answer is sufficiently detailed and useful. Only samples with an average score above 3.5 out of 5 are retained.

Dataset Outcome
The final dataset contains:

• approximately 4,000 instruction pairs
• 51 domains
• around 350 subtopics

However, the key value is not size but structured diversity and filtering quality.

Fine Tuning Phase : QLoRA on Qwen2.5 3B

Setup:

• Base model: Qwen2.5 3B Instruct
• Method: QLoRA
• Framework: Unsloth
• Hardware: Google Colab T4
• Training: 3 epochs, 714 steps

Why QLoRA

QLoRA enables efficient fine tuning by training low rank adapters instead of full model weights. This reduces memory usage while maintaining strong performance.

Training Behavior

The training loss shows steady convergence without instability or overfitting, indicating:

• high dataset consistency
• low noise after filtering
• stable learning dynamics

Evaluation

Key Observations in Base vs Fine tuned Model:

The base model exhibits:

• occasional language switching to Arabic
• incomplete or repetitive responses
• weak instruction adherence

The fine tuned model shows:

• fluent and consistent Persian output
• structured reasoning
• improved instruction following behavior

Key Insight

The improvement is not driven by model scaling but by data engineering. This highlights a central principle in modern LLM systems. data quality is often more important than model size

Key Technical Insights

Insight 1: Data quality is the primary bottleneck
Even a small dataset (4,000 samples) can significantly improve performance when properly curated.

Insight 2: Dual filtering is essential
Both semantic deduplication and LLM based scoring are required to maintain dataset quality.

Insight 3: Structured topic graphs outperform free form prompting Controlled topic distribution leads to better coverage and diversity.

Insight 4: LLM as a judge is a core system component
Automated evaluation is necessary for scalable dataset construction.

What this project demonstrates?

This system is not just a dataset generator. It is a complete synthetic data engine for low resource LLM alignment, consisting of:

• structured generation
• semantic filtering
• quality evaluation
• fine tuning integration
• performance benchmarking

Future Work

Potential improvements include:

• scaling dataset size beyond 50,000 samples
• integrating preference optimization (DPO)
• adding multilingual support
• incorporating human feedback loops (RLHF style training)

Conclusion

This project demonstrates a shift in LLM development:
performance improvements are increasingly driven by data systems rather than model scaling.By combining structured generation, filtering, and lightweight fine tuning, significant improvements can be achieved even in low resource language settings.

Links:
GitHub Repository
Dataset in Huggingface