DEV Community

Cover image for I Tested GPU Time-Slicing With Real LLMs So You Don't Have To ๐Ÿš€
Abraham Arellano Tavara
Abraham Arellano Tavara

Posted on • Originally published at myitbasics.com on

I Tested GPU Time-Slicing With Real LLMs So You Don't Have To ๐Ÿš€

I Tested GPU Time-Slicing With Real LLMs So You Don't Have To ๐Ÿš€

๐ŸŽฏ TL;DR - The Numbers Don't Lie

I spent a week testing NVIDIA time-slicing on AWS EKS with real LLM workloads (not toy examples). Here's what actually happens:

  • โœ… Time-slicing overhead: Only ~1% (NVIDIA crushed this)
  • โŒ Concurrent workloads: 50-100% performance degradation (physics can't be cheated)
  • ๐Ÿ’ฐ Cost savings: 50% reduction for sequential workloads
  • ๐ŸŽฏ Best use: Dev/test environments, time-shifted workloads

Bottom line: Time-slicing is brilliant for isolation, terrible for concurrent performance.

๐Ÿ“ฆ Full code, configs, and test scripts: GitHub Repository


๐Ÿ”‘ Quick Reference - Key Terms

Before we dive deep, here's your decoder ring:

Term What It Means Why You Care
Time-Slicing GPU virtualization creating multiple virtual GPUs from one physical GPU Lets multiple apps share a GPU
OOM Out Of Memory - when GPU runs out of VRAM Your pods crash mysteriously
TGI Text Generation Inference - HuggingFace's LLM serving engine Industry standard for serving models
Concurrent Multiple workloads running simultaneously Where performance degradation happens
Sequential Workloads running one after another Where time-slicing shines

๐Ÿ’ธ The $500 Question That Started This

Picture this: You're running two LLM models in production. That's $2/hour for two GPU instances. Over a month, that's $1,440. Your CFO is asking why the GPU bill is so high.

Then someone mentions NVIDIA time-slicing: "Just share one GPU between both models!"

The question everyone asks: Does this actually work without destroying performance?

The answer everyone gives: "It depends..." (not helpful)

So I decided to test it with real production workloads and actual performance measurement. No toy examples. No theoretical benchmarks. Just two real LLMs hammering a shared GPU.

Spoiler: The results surprised me.


๐Ÿ—๏ธ The Test Lab Setup

Here's what I built for this experiment:
Test Lab Setup

๐ŸŽฎ The Hardware

  • GPU: NVIDIA L40S (46GB VRAM) - The new hotness
  • Instance: g6e.2xlarge (~$1.01/hour in us-west-2)
  • Cost: Much cheaper than p3.8xlarge ($12.24/hour)
  • Kubernetes: EKS 1.32 with NVIDIA GPU Operator

๐Ÿค– The Contenders

Model A: Microsoft Phi-3.5-mini-instruct

  • Size: ~4GB memory footprint
  • Speed: Fast inference (< 1 second)
  • Use case: Quick responses, high throughput

Model B: DeepSeek-R1-Distill-Llama-8B

  • Size: ~8GB memory footprint
  • Speed: Slower but more thoughtful (~1 second)
  • Use case: Complex reasoning, detailed outputs

Both running: HuggingFace Text Generation Inference (TGI) 3.3.4

๐Ÿ’ก Why these models? They represent real production workloads - different sizes, different performance profiles, and combined they use ~12GB (26% of available 46GB).


๐Ÿ”ฅ The 3 Mistakes I Made (So You Don't Have To)

Mistake #1: "GPUs Just Workโ„ข" (They Don't)

What I expected: Spin up g6e.2xlarge, GPU drivers already installed (like p3 instances)

What actually happened: No GPU detected. Pods stuck in Pending. Panic.

kubectl describe pod
# Events: 0/1 nodes available: insufficient nvidia.com/gpu
Enter fullscreen mode Exit fullscreen mode

The plot twist: Unlike p3 instances, g6e.2xlarge doesn't come with pre-installed NVIDIA drivers in EKS managed node groups.

The fix that saved the day:

# NVIDIA GPU Operator does ALL the heavy lifting
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set nodeSelector.eks-node=gpu \
  --wait
Enter fullscreen mode Exit fullscreen mode

This magical operator automatically:

  • โœ… Installs NVIDIA drivers
  • โœ… Configures container toolkit
  • โœ… Deploys device plugin
  • โœ… Sets up GPU feature discovery

๐Ÿ’ก Pro tip: Always use GPU Operator for modern EKS setups. Manual driver installation is pain.


Mistake #2: "Just Deploy Both Models" (OOM Speedrun)

What I tried: Deploy both models with default settings

What happened: Both pods started... then crashed with cryptic errors

RuntimeError: CUDA out of memory. Tried to allocate 20.00 GiB
Enter fullscreen mode Exit fullscreen mode

The problem: Each model tried to grab ~80% of GPU memory. Math doesn't work:

  • Model A: 80% ร— 46GB = 36.8GB
  • Model B: 80% ร— 46GB = 36.8GB
  • Total needed: 73.6GB
  • Available: 46GB โŒ

The fix: Aggressive memory limits per model

args:
  - "--cuda-memory-fraction"
  - "0.4"  # ๐ŸŽฏ Only use 40% GPU memory per model
  - "--max-batch-prefill-tokens"
  - "4096"  # โš ๏ธ Reduced from default 8192
  - "--max-input-length"
  - "256"  # ๐Ÿ”’ Limit input size
  - "--max-total-tokens"
  - "512"  # ๐Ÿ”’ Limit output size
Enter fullscreen mode Exit fullscreen mode

The math that works:

  • Model A: 40% ร— 46GB = 18.4GB โœ…
  • Model B: 40% ร— 46GB = 18.4GB โœ…
  • Total: 36.8GB (80% utilization) โœ…
  • System overhead: 20% buffer โœ…

๐Ÿšจ Critical setting: Without cuda-memory-fraction, models will OOM during warmup. This isn't optional!


Mistake #3: "Time-Slicing Config Is Obvious" (It's Not)

What the docs say: Create a ConfigMap

What they don't say: You need TWO ConfigMaps and an operator upgrade

The complete configuration:

# ConfigMap 1: Time-slicing configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 10  # ๐ŸŽฏ 10 virtual GPUs from 1 physical

---
# ConfigMap 2: Device plugin config
apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 10
Enter fullscreen mode Exit fullscreen mode

Then upgrade the operator:

helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set devicePlugin.config.name=device-plugin-config \
  --wait
Enter fullscreen mode Exit fullscreen mode

Verify it worked:

kubectl describe node <gpu-node> | grep nvidia.com/gpu

# Before:  nvidia.com/gpu: 1  โŒ
# After:   nvidia.com/gpu: 10 โœ…
Enter fullscreen mode Exit fullscreen mode

๐ŸŽ‰ Success: Your cluster now advertises 10 virtual GPUs instead of 1!

What this means: You can now schedule 10 pods requesting nvidia.com/gpu: 1 on a single physical GPU.


๐Ÿ“Š The Results (Prepare to Be Surprised)

Test Scenario 1: Individual Performance (No Competition)

First, I tested each model alone with time-slicing enabled. Would time-slicing itself add overhead?

Phi-3.5-Mini Flying Solo

Configuration Avg Latency Throughput Success Rate
Time-sliced GPU 0.609s 98.44 req/min 100% โœ…
Exclusive GPU 0.603s 99.46 req/min 100% โœ…
Overhead +0.006s -1.02 req/min 0%

Overhead: ~1% ๐ŸŽ‰

DeepSeek-R1 Flying Solo

Configuration Avg Latency Throughput Success Rate
Time-sliced GPU 1.135s 52.84 req/min 100% โœ…
Exclusive GPU 1.142s 52.49 req/min 100% โœ…
Overhead -0.007s +0.35 req/min 0%

Overhead: ~1% (actually slightly faster!) ๐Ÿคฏ

๐Ÿ’ก Key Insight #1: NVIDIA time-slicing overhead is negligible. The virtualization layer is incredibly efficient. This is exceptional engineering.


Test Scenario 2: Concurrent Performance (The Real Test)

Now both models hitting the GPU simultaneously. Every request from both models at the same time.

This is where reality hits.

Phi-3.5-Mini Under Fire

Metric Baseline Concurrent Impact
Latency 0.609s 1.227s ๐Ÿ”ด +101.4%
Throughput 98.44 req/min 48.89 req/min ๐Ÿ”ด -50.3%
Success Rate 100% 100% โœ… Still stable

DeepSeek-R1 Under Fire

Metric Baseline Concurrent Impact
Latency 1.135s 1.778s ๐Ÿ”ด +56.6%
Throughput 52.84 req/min 33.74 req/min ๐Ÿ”ด -36.1%
Success Rate 100% 100% โœ… Still stable

๐Ÿšจ Key Insight #2: Resource competition is BRUTAL. When both models compete for the same GPU, performance tanks by 50-100%.


๐Ÿ“ˆ Visual Performance Comparison

Individual Performance (Time-Slicing Overhead)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Exclusive GPU:    โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 100%
Time-Sliced GPU:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘ 99%
                  โ†‘ Only 1% difference!

Concurrent Performance (Resource Competition)  
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Baseline:         โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 100%
Concurrent:       โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 50%
                  โ†‘ Ouch. Physics can't be cheated.
Enter fullscreen mode Exit fullscreen mode

๐Ÿค” Why This Happens (The Physics)

Time-slicing overhead (~1%):

  • โœ… Context switching is fast
  • โœ… Memory isolation is efficient
  • โœ… Scheduling overhead is minimal

Resource competition (50-100% degradation):

  • โŒ Both models fight for GPU cores
  • โŒ Memory bandwidth saturation
  • โŒ L2 cache thrashing
  • โŒ Shared memory contention

The verdict: Time-slicing technology is brilliant. GPU resource sharing is expensive.


๐ŸŽฏ The Decision Framework (Should YOU Use Time-Slicing?)

โœ… Perfect Use Cases - Deploy With Confidence

1. Development & Testing Environments ๐Ÿงช

Scenario: QA team needs to test 3 model versions
Cost without time-slicing: $3/hour (3 GPUs)
Cost with time-slicing: $1/hour (1 GPU)
Savings: $1,440/month
Performance impact: None (sequential testing)
Verdict: Slam dunk โœ…
Enter fullscreen mode Exit fullscreen mode

2. Time-Shifted Workloads โฐ

Scenario: Model A (business hours), Model B (batch processing at night)
Overlap: < 10% of time
Performance: 99% (negligible overhead when not competing)
Savings: 50% GPU costs
Verdict: Perfect fit โœ…
Enter fullscreen mode Exit fullscreen mode

3. Demo & POC Deployments ๐ŸŽฌ

Scenario: Sales demo with multiple model comparisons
Requirements: Not production, occasional use
Budget: Limited
Performance needs: "Good enough"
Verdict: Ideal use case โœ…
Enter fullscreen mode Exit fullscreen mode

4. CI/CD Model Testing ๐Ÿ”„

Scenario: Automated model validation pipelines
Pattern: Sequential test runs
Peak load: One test at a time
Cost optimization: Critical
Verdict: Great match โœ…
Enter fullscreen mode Exit fullscreen mode

โŒ Terrible Use Cases - Avoid These

1. Production Inference Serving ๐Ÿ’ผ

Scenario: Customer-facing API with SLA requirements
Requirement: < 100ms response time
Concurrent load: Unpredictable spikes
Impact: 50-100% degradation = SLA violations
Verdict: Don't even think about it โŒ
Enter fullscreen mode Exit fullscreen mode

2. High-Throughput Concurrent Workloads ๐Ÿš€

Scenario: Multiple models serving real-time traffic
Load pattern: Constant concurrent requests
Performance impact: Immediate 50% throughput loss
Business impact: Lost revenue, poor UX
Verdict: Hard pass โŒ
Enter fullscreen mode Exit fullscreen mode

3. Latency-Sensitive Applications โšก

Scenario: Real-time chat, autocomplete, voice assistants
SLA: Sub-second responses required
Concurrent degradation: Doubles latency
User impact: Frustrated users, high churn
Verdict: Nope โŒ
Enter fullscreen mode Exit fullscreen mode

4. Auto-Scaling Production Workloads ๐Ÿ“ˆ

Scenario: Traffic scales unpredictably
Problem: Can't predict when models compete
Risk: Performance collapse during peak times
Business impact: Revenue loss during high-traffic
Verdict: Too risky โŒ
Enter fullscreen mode Exit fullscreen mode

๐Ÿค” Decision Tree - Find Your Path

Start Here
    โ”‚
    โ”œโ”€ Is this production? โ”€โ”€โ”€ YES โ”€โ”€โ†’ Will workloads overlap?
    โ”‚                                       โ”‚
    โ”‚                                       โ”œโ”€ YES โ”€โ”€โ†’ โŒ Don't use time-slicing
    โ”‚                                       โ”‚
    โ”‚                                       โ””โ”€ NO โ”€โ”€โ”€โ†’ โœ… Consider time-slicing
    โ”‚
    โ””โ”€ NO (Dev/Test) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ โœ… Use time-slicing
                                                 (perfect use case!)
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ฐ ROI Calculator - Your Break-Even Analysis

Scenario Without Time-Slicing With Time-Slicing Monthly Savings
2 Models, Sequential $1,440 $720 $720 โœ…
2 Models, 30% Overlap $1,440 $720 $720 (but some degradation) โš ๏ธ
2 Models, 50% Overlap $1,440 $720 $720 (significant degradation) โŒ
2 Models, Always Concurrent $1,440 $720 Not worth it โŒ

Break-even point: If your workloads overlap < 30% of the time, time-slicing typically provides net positive value.

๐Ÿ’ก Pro Tip: Monitor actual workload overlap in production before deciding. Use CloudWatch metrics to track GPU utilization patterns.


๐Ÿงช How I Tested This (Reproducible Science)

The Testing Strategy

I built an automated framework to eliminate human error and ensure reproducible results:

Test Protocol:

  1. โ˜๏ธ Test each model individually (establish baseline)
  2. โœŒ๏ธ Test both models concurrently (measure degradation)
  3. ๐Ÿ” Repeat 3 times with 5 different prompts (45 requests total)
  4. ๐Ÿ“Š Calculate statistical averages and impact percentages

The Automation Script

Here's the core testing logic (simplified):

#!/bin/bash
# Complete performance testing framework

test_individual_model() {
    local endpoint=$1
    local model_name=$2

    # Test prompts covering different complexity levels
    local prompts=(
        "Explain machine learning"
        "What is Python programming"
        "Describe cloud computing"
        "How does AI work"
        "What are automation benefits"
    )

    # Run 3 iterations for statistical accuracy
    for iteration in $(seq 1 3); do
        for prompt in "${prompts[@]}"; do
            # Measure with millisecond precision
            start_time=$(date +%s.%N)

            response=$(curl -s -X POST "$endpoint/generate" \
                -H "Content-Type: application/json" \
                -d "{
                    \"inputs\": \"$prompt\",
                    \"parameters\": {
                        \"max_new_tokens\": 50,
                        \"temperature\": 0.7
                    }
                }")

            end_time=$(date +%s.%N)
            duration=$(echo "$end_time - $start_time" | bc)

            # Record results
            echo "$duration" >> "${model_name}_results.txt"
        done
    done

    # Calculate statistics
    calculate_stats "${model_name}_results.txt"
}

test_concurrent_models() {
    # Fire both requests simultaneously using background jobs
    for prompt in "${prompts[@]}"; do
        # Model A request
        {
            measure_latency "$PHI35_ENDPOINT" "$prompt" >> phi_concurrent.txt
        } &

        # Model B request  
        {
            measure_latency "$DEEPSEEK_ENDPOINT" "$prompt" >> deepseek_concurrent.txt
        } &

        # Wait for both to complete
        wait
    done
}
Enter fullscreen mode Exit fullscreen mode

Kubernetes Scaling for Test Control

The genius part: Using Kubernetes to control test scenarios:

# Test Phi-3.5 alone
kubectl scale deployment deepseek-r1-baseline --replicas=0 -n llm-testing
# Wait 30 seconds for graceful shutdown
./load_test.sh

# Test DeepSeek alone
kubectl scale deployment mistral-7b-baseline --replicas=0 -n llm-testing
kubectl scale deployment deepseek-r1-baseline --replicas=1 -n llm-testing
# Wait 30 seconds for startup
./load_test.sh

# Test both concurrently
kubectl scale deployment mistral-7b-baseline --replicas=1 -n llm-testing
# Wait 30 seconds for startup
./load_test.sh
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ก Why this works: Scaling deployments ensures clean test isolation without manual intervention or pod management.

What Made This Scientific

โœ… Controlled environment: No other GPU workloads running
โœ… Multiple iterations: 3 runs ร— 5 prompts = statistical validity
โœ… Standardized prompts: Same inputs across all tests
โœ… Consistent parameters: Same token limits, temperature
โœ… Automated execution: Eliminates human timing errors
โœ… Millisecond precision: Accurate latency measurement

Sample Output

=== Phi-3.5-Mini (Individual Baseline) ===
Total Requests: 15
Successful: 15 (100%)
Average Latency: 0.609s
Throughput: 98.44 req/min

=== Phi-3.5-Mini (Concurrent) ===
Average Latency: 1.227s (+101.4% ๐Ÿ”ด)
Throughput: 48.89 req/min (-50.3% ๐Ÿ”ด)

Report saved: test_results/GPU_SLICING_FULL_performance_report_20250725_095710.txt
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“ฆ Get the complete testing framework: GitHub Repository


๐Ÿ’ฐ The Money Talk - Real ROI Analysis

Let's talk dollars and cents. Because at the end of the day, your CFO cares about the bottom line.

Scenario 1: Traditional Approach (Separate GPUs)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Model A: g6e.2xlarge           โ”‚
โ”‚  Cost: $1.01/hour               โ”‚
โ”‚  Performance: 100% โœ…            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Model B: g6e.2xlarge           โ”‚
โ”‚  Cost: $1.01/hour               โ”‚
โ”‚  Performance: 100% โœ…            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Total: $2.02/hour = $1,454/month
Enter fullscreen mode Exit fullscreen mode

Scenario 2: Time-Slicing (Sequential Workloads)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Single g6e.2xlarge             โ”‚
โ”‚                                 โ”‚
โ”‚  Model A (9am-5pm)  โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  Model B (6pm-8am)  โ”€โ”€โ”€โ”€โ”€โ”€โ”ค    โ”‚
โ”‚                                 โ”‚
โ”‚  Cost: $1.01/hour               โ”‚
โ”‚  Performance: 99% โœ…             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Total: $1.01/hour = $727/month
Savings: $727/month (50% reduction! ๐ŸŽ‰)
Enter fullscreen mode Exit fullscreen mode

When this works: Workloads naturally time-shifted (batch processing, different timezones, dev/staging)


Scenario 3: Time-Slicing (Concurrent Workloads)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Single g6e.2xlarge             โ”‚
โ”‚                                 โ”‚
โ”‚  Model A + Model B (competing)  โ”‚
โ”‚                                 โ”‚
โ”‚  Cost: $1.01/hour               โ”‚
โ”‚  Performance: 50% โš ๏ธ             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Total: $1.01/hour = $727/month
Savings: $727/month
Trade-off: 50% performance loss ๐Ÿ’€
Enter fullscreen mode Exit fullscreen mode

When this fails: Production inference, customer-facing APIs, latency-sensitive applications


The Financial Break-Even Matrix

Workload Overlap Cost Savings Performance Recommended?
0-10% (mostly sequential) 50% โœ… 99% โœ… Yes ๐ŸŽฏ
10-30% (occasional overlap) 50% โœ… 80-90% โš ๏ธ Maybe ๐Ÿค”
30-50% (frequent overlap) 50% โœ… 60-80% โš ๏ธ Risky ๐Ÿ˜ฌ
50%+ (mostly concurrent) 50% โŒ 50% โŒ No ๐Ÿšซ

Real-World Cost Example (My Consulting Client)

Their Setup:

  • Dev environment: 2 models for A/B testing
  • Usage pattern: Sequential (test Model A, then Model B)
  • Previous cost: $1,440/month (2 GPUs)

After Time-Slicing:

  • New cost: $720/month (1 GPU)
  • Performance: 99% (negligible overhead)
  • Savings: $8,640/year ๐Ÿ’ฐ

CFO's reaction: "Why weren't we doing this before?"


The Hidden Costs of Getting It Wrong

Mistake: Using time-slicing for production inference

Scenario: E-commerce chatbot with strict SLA (< 500ms response)

Before time-slicing:
Response time: 400ms โœ…
Conversion rate: 12% โœ…
Revenue impact: $0

After time-slicing (concurrent load):
Response time: 800ms โŒ (SLA breach)
Conversion rate: 8% โŒ (users bounce)
Revenue impact: -$50,000/month ๐Ÿ’€
Enter fullscreen mode Exit fullscreen mode

Lesson: The $720/month GPU savings cost them $50,000/month in revenue. Not worth it.


Your ROI Decision Tree

Question 1: Are your workloads production-facing?
    โ”‚
    โ”œโ”€ NO โ”€โ”€โ†’ Question 2: Do workloads overlap?
    โ”‚           โ”‚
    โ”‚           โ”œโ”€ NO โ”€โ”€โ†’ โœ… Use time-slicing (50% savings!)
    โ”‚           โ”‚
    โ”‚           โ””โ”€ YES โ”€โ”€โ†’ โš ๏ธ Prototype and measure first
    โ”‚
    โ””โ”€ YES โ”€โ”€โ†’ Question 3: Can you tolerate 50% performance loss?
                โ”‚
                โ”œโ”€ NO โ”€โ”€โ†’ โŒ Don't use time-slicing
                โ”‚
                โ””โ”€ YES โ”€โ”€โ†’ ๐Ÿค” Are you SURE? Measure twice, deploy once.
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ก Pro Tip: Always prototype with time-slicing in staging before production. Measure actual performance impact with YOUR workloads, not theoretical benchmarks.


๐Ÿš€ Quick Start - Get Running in 30 Minutes

Want to try this yourself? Here's the exact path I followed.

Prerequisites Check โœ…

# Verify you have these tools installed
kubectl version --client
helm version
eksctl version
aws --version

# If any are missing, install from:
# kubectl: https://kubernetes.io/docs/tasks/tools/
# helm: https://helm.sh/docs/intro/install/
# eksctl: https://eksctl.io/installation/
# aws: https://aws.amazon.com/cli/
Enter fullscreen mode Exit fullscreen mode

Step 1: Create EKS Cluster (15 minutes)

# Create cluster configuration file
cat << 'EOF' > cluster-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: gpusharing-demo
  region: us-west-2
  version: "1.32"
nodeGroups:
  - name: main
    instanceType: t3.large
    desiredCapacity: 2
    minSize: 2
    maxSize: 4
  - name: gpu
    instanceType: g6e.2xlarge
    desiredCapacity: 1
    minSize: 1
    maxSize: 1
    labels:
      eks-node: gpu
EOF

# Create the cluster (takes ~15 minutes)
eksctl create cluster -f cluster-config.yaml

# Verify nodes are ready
kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

What you'll see:

NAME                         STATUS   ROLE    AGE
ip-192-168-1-1...            Ready    <none>  5m    # t3.large
ip-192-168-1-2...            Ready    <none>  5m    # t3.large  
ip-192-168-1-3...            Ready    <none>  5m    # g6e.2xlarge (GPU!)
Enter fullscreen mode Exit fullscreen mode

Step 2: Install NVIDIA GPU Operator (5 minutes)

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator (this does ALL the heavy lifting)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set nodeSelector.eks-node=gpu \
  --wait

# Verify installation (all pods should be Running)
kubectl get pods -n gpu-operator
Enter fullscreen mode Exit fullscreen mode

Wait for all pods to show 1/1 Running (takes 2-3 minutes)


Step 3: Enable Time-Slicing (3 minutes)

# Download complete configuration
wget https://raw.githubusercontent.com/AbrahamArellano/eks-shared-gpu-ai-performance/main/infra/time-slicing-config.yaml

# Apply time-slicing configuration
kubectl apply -f time-slicing-config.yaml

# Upgrade GPU operator with time-slicing
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set devicePlugin.config.name=device-plugin-config \
  --wait
Enter fullscreen mode Exit fullscreen mode

Verify it worked:

kubectl describe node $(kubectl get nodes -l eks-node=gpu -o jsonpath='{.items[0].metadata.name}') | grep "nvidia.com/gpu:"

# Expected output:
#  nvidia.com/gpu:     10  โœ… (not 1!)
Enter fullscreen mode Exit fullscreen mode

Step 4: Deploy Your Models (5 minutes)

# Create namespace
kubectl create namespace llm-testing

# Clone the complete repository
git clone https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance.git
cd eks-shared-gpu-ai-performance

# Deploy both models with memory-optimized configs
kubectl apply -f models/mistral-memory-optimized.yaml
kubectl apply -f models/deepseek-memory-optimized.yaml

# Watch pods start (takes 2-3 minutes to download models)
kubectl get pods -n llm-testing -w
Enter fullscreen mode Exit fullscreen mode

Wait for both pods to show 1/1 Running


Step 5: Run Performance Tests (2 minutes)

# Port forward to access models locally
kubectl port-forward svc/mistral-7b-service 8081:8080 -n llm-testing &
kubectl port-forward svc/deepseek-r1-service 8082:8080 -n llm-testing &

# Run the complete test suite
cd tests
chmod +x load_test.sh
./load_test.sh
Enter fullscreen mode Exit fullscreen mode

Output you'll see:

=== Complete GPU Time-Slicing Performance Analysis ===
Testing Phi-3.5-Mini (Individual Baseline)...
  โœ“ Test 1: 0.610s
  โœ“ Test 2: 0.602s
  ...

Testing DeepSeek-R1 (Individual Baseline)...
  โœ“ Test 1: 1.142s
  ...

Testing Both Models Concurrently...
  โœ“ Both completed
  ...

Report saved: test_results/performance_report_YYYYMMDD_HHMMSS.txt
Enter fullscreen mode Exit fullscreen mode

Step 6: View Your Results

# View the latest report
cat tests/test_results/performance_report_*.txt | tail -30
Enter fullscreen mode Exit fullscreen mode

You'll see something like this:

=== Phi-3.5-Mini Individual Baseline ===
Average Latency: 0.609s
Throughput: 98.44 req/min

=== Phi-3.5-Mini Concurrent Performance ===
Average Latency: 1.227s
Performance Impact: +101.4% latency ๐Ÿ”ด
Enter fullscreen mode Exit fullscreen mode

๐ŸŽ‰ Success! You've Now:

โœ… Created an EKS cluster with GPU support
โœ… Enabled NVIDIA time-slicing (10 virtual GPUs)
โœ… Deployed two real LLM models
โœ… Measured actual performance impact
โœ… Generated comprehensive performance reports


Cleanup (Don't Forget!)

# Delete the entire cluster to avoid charges
eksctl delete cluster gpusharing-demo --region us-west-2

# Verify deletion
aws eks list-clusters --region us-west-2
Enter fullscreen mode Exit fullscreen mode

โš ๏ธ Important: Running this setup costs ~$1.20/hour. Don't forget to delete when done!


Troubleshooting Common Issues

Problem: Pods stuck in Pending

# Check if GPU is detected
kubectl describe node <gpu-node> | grep nvidia.com/gpu

# If shows 0, restart device plugin
kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n gpu-operator
Enter fullscreen mode Exit fullscreen mode

Problem: Models crash with OOM

# Check cuda-memory-fraction in deployment
kubectl describe deployment mistral-7b-baseline -n llm-testing

# Should see: --cuda-memory-fraction 0.4
# If not, update the YAML and reapply
Enter fullscreen mode Exit fullscreen mode

Problem: Can't access models via port-forward

# Check if services exist
kubectl get svc -n llm-testing

# Check if pods are ready
kubectl get pods -n llm-testing

# Restart port-forward
pkill -f port-forward
kubectl port-forward svc/mistral-7b-service 8081:8080 -n llm-testing &
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“š Next Steps

  • Experiment: Try different models from HuggingFace
  • Optimize: Tune memory fractions for your workloads
  • Monitor: Set up CloudWatch for GPU metrics
  • Scale: Add more GPU nodes if needed

Complete implementation guide: GitHub Repository


๐Ÿ’ก 5 Things I Wish I Knew Before Starting

1. "Pre-installed Drivers" Doesn't Mean What You Think

What I assumed: g6e instances come with NVIDIA drivers like p3 instances

Reality check: Spent 2 hours debugging why pods couldn't see the GPU

The lesson: Always use GPU Operator for modern EKS setups. It's not optionalโ€”it's essential.

Time saved for you: 2 hours of confusion ๐Ÿ˜…


2. Memory Limits Are Not Suggestions

What I did first: Deployed models with default settings

What happened: Both models tried to grab 80% of GPU memory each

The crash: CUDA out of memory errors everywhere

The fix: cuda-memory-fraction: 0.4 is your best friend

Lesson: In GPU sharing, aggressive memory limits aren't pessimisticโ€”they're realistic.


3. Time-Slicing โ‰  Magic Performance Multiplier

Marketing says: "Share one GPU across multiple workloads!"

Reality says: "Share one GPU across multiple workloads... but not at full speed concurrently"

The truth: Time-slicing provides isolation, not performance multiplication.

Mental model: Think of it like time-sharing a CPU, not adding more cores.


4. Test Sequential Before Assuming Concurrent

My mistake: Assumed concurrent workloads would work "well enough"

The numbers: 50-100% performance degradation

The learning: Always measure YOUR workloads with YOUR patterns

Pro tip: Use Kubernetes scaling to isolate test scenarios cleanly


5. Production โ‰  Development (Obvious, But...)

Development: Time-slicing is perfect

  • Cost savings? Yes โœ…
  • Performance trade-offs? Acceptable โœ…
  • Stability? Excellent โœ…

Production: Time-slicing is risky

  • SLA requirements? Violated โŒ
  • Unpredictable performance? Dangerous โŒ
  • Customer experience? Compromised โŒ

The rule: If it touches paying customers, provision separate GPUs.


๐ŸŽฌ The Verdict - Should You Use Time-Slicing?

After a week of testing, thousands of inference requests, and countless hours of analysis, here's my honest take:

โœ… Time-Slicing Is Brilliant For:

  • Development environments where cost matters more than peak performance
  • Sequential workloads with natural time-shifting patterns
  • A/B testing where models don't compete simultaneously
  • POC/Demo environments with flexible requirements
  • Learning and experimentation without breaking the bank

ROI: 50% cost savings with 99% performance โœ…


โŒ Time-Slicing Is Terrible For:

  • Production inference serving customer traffic
  • Concurrent workloads with strict SLA requirements
  • Latency-sensitive applications where milliseconds matter
  • Revenue-generating systems where performance = money
  • Auto-scaling workloads with unpredictable patterns

Risk: 50-100% performance degradation = unhappy customers โŒ


The Technology Itself? ๐Ÿ† A+ Engineering

NVIDIA absolutely crushed the implementation:

  • Only ~1% overhead from time-slicing mechanism
  • Rock-solid stability (zero crashes in extensive testing)
  • Clean Kubernetes integration
  • Production-grade reliability

The performance degradation comes from physics, not technology.

You can't cheat the fundamental limitations of shared resources. Time-slicing doesn't create more GPU computeโ€”it manages access to existing compute.


๐Ÿš€ Your Next Steps

If You're Convinced (Dev/Test Use Case):

  1. โญ Star the repo: GitHub Repository
  2. ๐Ÿ”ง Follow the Quick Start: 30 minutes to working setup
  3. ๐Ÿ“Š Run your own tests: Measure YOUR workloads
  4. ๐Ÿ’ฐ Calculate YOUR ROI: Use the decision framework
  5. ๐ŸŽ‰ Deploy and save money: Start with dev environments

If You're Skeptical (Production Use Case):

  1. โœ… Provision separate GPUs: Safety first
  2. ๐Ÿงช Test time-slicing in staging: Validate with real traffic patterns
  3. ๐Ÿ“ˆ Monitor overlap patterns: Measure actual concurrent load
  4. ๐Ÿค” Reconsider for off-peak: Maybe time-slice during low-traffic hours?

If You're Curious (Learning Mode):

  1. ๐Ÿ“– Read the full guide: Complete blog post
  2. ๐ŸŽ“ Understand the concepts: Time-slicing vs MIG vs MPS
  3. ๐Ÿ› ๏ธ Experiment safely: Use the provided test framework
  4. ๐Ÿ’ฌ Share your findings: Comment below with your results

๐Ÿ“š Complete Resource Library

Code & Configuration

  • ๐Ÿ“ฆ GitHub Repository: eks-shared-gpu-ai-performance
    • Complete Kubernetes manifests
    • Automated testing framework
    • Performance analysis scripts
    • Troubleshooting guides

Deep Dive Content

  • ๐Ÿ“ Full Technical Analysis: MyITBasics.com
  • ๐Ÿ—๏ธ Architecture Patterns: Complete infrastructure setup guide
  • ๐Ÿ” Performance Analysis: Detailed metrics and methodology
  • ๐Ÿ’ก Best Practices: Production-ready recommendations

๐Ÿ’ฌ Let's Discuss - Your Turn!

I've shared my findings. Now I want to hear yours:

๐Ÿ’ญ Questions for the community:

  • Have you used GPU time-slicing in production? What was your experience?
  • What workload patterns are you trying to optimize?
  • Any other GPU sharing strategies you've found effective?
  • Found bugs or improvements in my testing methodology?

๐Ÿ› Found an issue in the code?
Open an issue or PR on GitHub

๐Ÿ’ก Want to discuss your specific use case?
Drop a comment belowโ€”I read and respond to all of them!

๐Ÿ“ง Need consulting help?
Visit MyITBasics.com for architecture guidance


๐Ÿ™ Thanks for Reading!

If you found this helpful:

  • โญ Star the GitHub repo to bookmark for later
  • ๐Ÿ’ฌ Comment below with your experiences or questions
  • ๐Ÿ”„ Share this post with your team
  • ๐Ÿ‘ค Follow me for more deep-dives into GPU architecture, AI infrastructure, and cloud-native engineering

Coming up next: Multi-GPU strategies, MIG vs time-slicing comparison, and cost optimization techniques for production AI workloads.

Stay tuned! ๐Ÿš€


Built with curiosity, tested with rigor, shared with the community.

โ€” Abraham Arellano
Cloud Architect & AI Infrastructure Engineer
MyITBasics.com | GitHub

Top comments (0)