DEV Community: Abraham Arellano Tavara

Choosing Between ML-KEM and ML-DSA for Your Post-Quantum Migration [Part 2]

Abraham Arellano Tavara — Sun, 09 Nov 2025 15:50:13 +0000

Post-Quantum Cryptography Migration Series:

Part 1: The Quantum Threat

Part 2: ML-KEM vs ML-DSA (You are here)

Quick Recap from Part 1

In Part 1, we established that the quantum threat isn't coming—it's already here through harvest-now-decrypt-later attacks. Adversaries are collecting encrypted data today to decrypt when quantum computers mature around 2030-2035.

The urgency: If your data retention period + migration time > time until quantum computers, you've already run out of time to wait.

Now comes the critical question: What do we actually migrate TO?

The Question Every Architect Is Asking

"Should we use ML-KEM or ML-DSA?"

I've seen this question come up repeatedly in architecture discussions, and honestly, the confusion is understandable. The acronyms are overwhelming, and most documentation assumes you already know the difference.

Here's the reality: you need both. They solve fundamentally different problems.

What Actually Happened in August 2024

NIST finalized three post-quantum cryptography standards after 8 years of global scrutiny:

FIPS 203 (ML-KEM): Key encapsulation for establishing shared secrets
FIPS 204 (ML-DSA): Digital signatures for authentication
FIPS 205 (SLH-DSA): Hash-based backup signatures

These aren't experimental. They're production-ready and already deployed by AWS, Google, and Cloudflare.

ML-KEM: Your TLS Handshakes

What it replaces: RSA/ECDH key exchange

Where you'll use it:

TLS 1.3 connections (HTTPS, APIs)
VPN tunnels (IPsec, OpenVPN)
SSH sessions
Any key establishment protocol

Key sizes:

ML-KEM-768 (recommended):
  Public key: 1,184 bytes (vs. 32 bytes for X25519)
  Ciphertext: 1,088 bytes

Performance overhead: ~150 microseconds per handshake
With connection reuse: effectively 0%

Real-world example (AWS KMS):

KmsClient kms = KmsClient.builder()
    .httpClient(AwsCrtHttpClient.builder()
        .postQuantumTlsEnabled(true)  // ML-KEM-768 enabled
        .build())
    .build();

Benchmark results:

0.05% throughput reduction with proper connection pooling
0.3% latency increase on initial handshake
Negligible impact with TLS reuse

ML-DSA: Your Code Signing

What it replaces: RSA/ECDSA signatures

Where you'll use it:

Software distribution (binaries, containers)
JWT/OAuth tokens
Document signing
Future TLS certificates (when CAs support it)

Signature sizes:

ML-DSA-65 (recommended):
  Public key: 1,952 bytes
  Signature: 3,309 bytes (vs. 64 bytes for ECDSA P-256)

The surprising part:
ML-DSA is actually 10x faster than RSA-2048 for signing operations.

RSA-2048 signing: 2-5 milliseconds
ML-DSA-65 signing: 100-200 microseconds

Verification is similarly fast.

The Critical Part: Hybrid Mode

Never deploy pure post-quantum yet.

Hybrid combines classical + PQC:

TLS handshake:
  1. ECDH key exchange (classical)
  2. ML-KEM-768 key exchange (PQC)
  3. Combined: KDF(ECDH_secret || ML-KEM_secret)

Security: Attacker must break BOTH to compromise

Why this matters:

ML-KEM was finalized only in 2024
Implementation vulnerabilities might emerge
Side-channel attacks could be discovered
Hybrid provides insurance

Industry consensus:

NIST recommends a hybrid during transition
NSA CNSA 2.0 allows hybrid through 2030
IETF standardizing hybrid TLS specs
AWS/Azure/GCP implement hybrid by default

Decision Framework for Developers

For key exchange:

95% of use cases → ML-KEM-768 (hybrid with X25519)
Government/NSS → ML-KEM-1024 (CNSA 2.0 requirement)
Future diversity → Plan for HQC (code-based, finalizing 2027)

For signatures:

General purpose → ML-DSA-65
Long-term archival → SLH-DSA-256 (hash-based, conservative)
Embedded/IoT → FN-DSA-512 (compact, when FIPS 206 finalizes)

Performance comparison:

The Migration Timeline

Why this is urgent:

"Harvest now, decrypt later" attacks are active today. Adversaries are collecting encrypted data to decrypt when quantum computers arrive (~2030-2035).

Mosca's Theorem:

If: data_shelf_life + migration_time > time_until_quantum
Then: Start migration NOW

For most enterprises with sensitive data, that equation already fails.

Common Pitfalls

❌ "We'll wait for better algorithms"

These algorithms survived 8 years of attempted breaks. This IS the mature version.

❌ "We don't need hybrid"

Even AWS, Google, and NIST recommend hybrid. Don't skip it.

❌ "Key sizes will break our protocols"

1,184 bytes is manageable for modern networks. Packet fragmentation is handled by TLS.

❌ "Performance will be terrible"

With connection reuse, overhead is negligible. We've benchmarked it.

Getting Started

Step 1: Update your TLS libraries

OpenSSL 3.2+ (experimental support)
BoringSSL (Google's fork, deployed in Chrome)
AWS-LC (FIPS validated, production-ready)

Step 2: Test in non-production

# OpenSSH 9.9+ with ML-KEM
ssh -o KexAlgorithms=mlkem768x25519-sha256 user@host

Step 3: Monitor performance

Measure baseline (classical only)
Enable hybrid mode
Compare P50/P95/P99 latency
Check for regressions

Step 4: Gradual rollout

Canary deployment (5%)
Monitor for 1 week
Expand to 25%, 50%, 100%

Resources

I wrote a comprehensive deep-dive covering:

Complete algorithm comparison matrix
Performance benchmarks
Algorithm selection flowchart
Hybrid deployment strategies

📖 Read the full guide: Post-Quantum Cryptography Algorithms Explained

Also includes:

Downloadable algorithm selection flowchart (PDF)
PQC migration checklist
Links to NIST standards and AWS documentation

Bottom Line

The standards are ready. The implementations exist. Cloud providers are deploying.

Your move:

Identify systems using RSA/ECDH today
Update SDKs to versions supporting ML-KEM
Test hybrid mode in staging
Plan production rollout

The quantum threat isn't theoretical—it's operational today through harvest-now-decrypt-later attacks.

What's your migration strategy? Already testing ML-KEM, or still evaluating? 👇

What's Next in This Series

You now understand the threat (Part 1) and which algorithms to use (Part 2). In Part 3, we'll get hands-on with AWS implementation—real code, performance benchmarks, and operational guidance for deploying ML-KEM in production.

This is Part 2 of a 6-part series on post-quantum cryptography migration. Read Part 1: The Quantum Threat if you haven't already.

cybersecurity #cryptography #quantum #devops

The Quantum Threat Nobody's Taking Seriously (But Should)

Abraham Arellano Tavara — Sun, 02 Nov 2025 19:53:48 +0000

"We'll wait until quantum computers are actually here."

I hear this from security teams constantly. And every time, I cringe.

Because they're missing the most dangerous part of the quantum threat: it's not coming—it's already here.

The Attack That's Happening Right Now

Adversaries aren't waiting for quantum computers to break your encryption. They're executing what's called "Harvest Now, Decrypt Later" (HNDL) attacks—passively collecting your encrypted traffic today to decrypt in 2030-2035 when quantum computers mature.

Your M&A negotiation emails from last month? Collected.

Patient medical records from your healthcare system? Stored.

Strategic defense communications? Archived.

All waiting for Q-Day.

The scary part? This is completely passive. No intrusion alerts. No failed login attempts. No evidence. Just silent collection of encrypted data that will become readable in a decade.

The Math That Changes Everything

Dr. Michele Mosca developed a simple formula that should terrify every security architect:

If X + Y > Z, you're at risk

Where:

X = How long your data must stay secret
Y = How long migration takes
Z = Time until quantum computers arrive

Let's run this for a typical healthcare organization:

X = 30 years (HIPAA medical record retention)
Y = 5 years (time to migrate complex systems)
Z = 10 years (conservative quantum estimate)

30 + 5 = 35 > 10

They've already run out of time to wait.

The Financial Reality

According to IBM's 2024 Data Breach Report, the average healthcare breach costs $9.77 million. But that's for breaches discovered today.

What about the quantum liability? Consider 10 years of patient data being harvested right now, then decrypted in 2035. At $50,000 per HIPAA violation per record, a mid-size healthcare provider could be looking at hundreds of millions in potential liability.

And it's not just healthcare. Financial services process $500 billion daily. Government agencies hold state secrets that never expire. Even commercial enterprises have 5-10 year product roadmaps that competitors would pay millions to access.

The Compliance Hammer

The NSA's CNSA 2.0 isn't a suggestion—it's a mandate with hard deadlines:

2025: Software/firmware signing transition begins
2027: New government systems must support post-quantum crypto
2030: VPNs, routers, firewalls must be compliant
2035: Complete quantum-resistant transition required

If you're in government, defense, or their supply chain, you must comply or lose contracts. And those requirements cascade down through vendors and subcontractors.

Why "Wait for Standards" Fails

The most common response I hear: "We'll wait until the standards mature."

Here's the problem with that strategy:

Standards ARE finalized. NIST published FIPS 203, 204, and 205 in August 2024. The "wait for standards" excuse expired 18 months ago.

Migration takes 5-10 years. This isn't a weekend deployment. It's discovery, planning, pilot programs, production rollout, and legacy system transitions. For complex enterprises, that's easily a decade.

Data is being harvested NOW. Every day you wait is another day of encrypted traffic being collected for future decryption.

The Bottom Line

This isn't about whether quantum computers will break RSA encryption. They will.

It's not about whether post-quantum standards exist. They do.

It's about time.

For most organizations with sensitive data, the calculation is clear: you need data to stay secret longer than the time you have before quantum computers arrive plus the time it takes to migrate.

The question isn't whether to migrate to post-quantum cryptography. It's whether you'll start before or after your data gets harvested.

Want the Full Analysis?

I've written a comprehensive deep-dive covering:

Complete three-phase HNDL attack patterns and how they work
Industry-specific risk calculations (healthcare, financial, government, enterprise)
Detailed CNSA 2.0 compliance timeline with specific deadlines
Why the $4.88M average breach cost dramatically underestimates quantum-era exposure
Strategic migration frameworks and vendor dependency management
What's actually vulnerable vs. safe in your current crypto stack

Read the full article: The Quantum Threat: Why "Harvest Now, Decrypt Later" Means Your Data Is Already at Risk

Why Your Authentication Architecture Is Your Biggest Security Blind Spot

Abraham Arellano Tavara — Sat, 25 Oct 2025 18:20:56 +0000

Every second, millions of authentication decisions are being made across global networks. Each one is a potential point of vulnerability—or a fortress of trust.

After architecting authentication systems across diverse infrastructures for years, I've noticed something troubling: most technical teams focus on implementing authentication methods while completely missing the architectural foundations that determine whether their systems will stand or fall under attack.

The Authentication Paradox

Here's the challenge that keeps security architects up at night: authentication seems deceptively simple at first. Verify the user is who they claim to be. Easy, right?

But in practice, this spawns a complex web of technical decisions that ripple through every layer of your system. Like a medieval castle's defense system, modern authentication must protect multiple entry points while maintaining efficient access for legitimate users.

The Four Pillars You Can't Ignore

The diagram above illustrates something critical that many architects overlook: authentication isn't just about the login screen. It's a complete architectural layer that touches every component of your system.

Modern authentication architecture rests on four interconnected pillars:

Validation Mechanisms - Gone are the days of simple password checks. Today's systems orchestrate a sophisticated ballet of verification methods, from biometric validation to behavioral analysis. Each mechanism must work in concert, creating harmony between security and usability.

Security Boundaries - Think of these as fortified vaults within vaults. Each boundary must protect its contents and resist attacks on its own infrastructure. Minor boundary breaches can cascade into major security incidents—I've seen it happen.

Trust Management - Creating and maintaining trust states is like diplomatic relations between nations. Initial trust must be established through rigorous verification, then maintained through continuous validation that adapts to changing conditions.

Failure Handling - Here's the counterintuitive part: how your system fails is as important as how it succeeds. Secure failure modes must prevent unauthorized access while maintaining availability and user experience.

What Modern Threats Actually Look Like

The authentication landscape in 2025 isn't what it was even two years ago. We're dealing with:

Distributed architecture complexity where authentication must work seamlessly across microservices, multiple clouds, and hybrid environments
Sophisticated attack vectors beyond simple password attacks—think credential stuffing, replay attacks, and AI-powered social engineering
The zero-trust imperative where authentication serves as the new security perimeter, replacing outdated perimeter-based models
Regulatory evolution with GDPR, CCPA, and industry-specific requirements demanding more robust mechanisms

The Architect's Dilemma

From the architect's perspective, every authentication decision creates a cascade of implications:

System Architecture - Authentication requirements fundamentally shape your entire stack, from database design to API structures. These decisions ripple through every layer.

Performance at Scale - Authentication sits in the critical path of user interactions. Every millisecond matters. Modern systems must balance robust security with lightning-fast performance through sophisticated caching and optimized cryptographic operations.

Security Defense-in-Depth - Like a medieval castle with multiple walls and moats, your authentication must implement layered security with multiple validation checkpoints and separated security contexts.

Scalability Engineering - As systems grow, authentication must scale proportionally. This isn't just about handling more users—it's about maintaining security and performance under increasing load.

The Hidden Vulnerabilities

Here's what keeps me up at night: even the most robustly designed authentication systems can be vulnerable to subtle, sophisticated attack vectors that exploit their physical implementation rather than their logical design.

Side-channel attacks, timing analysis, cache behaviors, and microarchitectural vulnerabilities can all compromise authentication implementations in ways that traditional security testing completely misses.

What You Should Do Next

If you're architecting authentication systems (or inheriting one), here are my immediate recommendations:

Audit your authentication boundaries - Map out every trust boundary in your system and test for cascade failures
Measure your authentication latency - If you're adding more than 50ms to user interactions, you need optimization
Review your failure modes - How does your system fail? Does it fail securely?
Plan for scale - Can your authentication system handle 10x your current load? 100x?

The Deep Dive

This barely scratches the surface of what modern authentication architecture demands. In my comprehensive guide, I break down:

The technical intricacies of password-based authentication beyond simple storage
How hardware tokens actually work at a cryptographic level
Real-world implementation challenges and solutions from actual production systems
Performance optimization techniques that maintain security
The emerging world of biometric authentication and its architectural implications
How side-channel attacks can compromise even well-designed systems

👉 Read the full technical deep dive on Authentication Architecture

The guide includes detailed diagrams, code examples, and architectural patterns drawn from years of production experience. Whether you're building a new system or securing an existing one, understanding these foundations is crucial.

Your Turn

What authentication challenges are you facing in your architecture? Have you discovered any surprising vulnerabilities in your systems? Drop your experiences in the comments—I'd love to hear how other architects are tackling these problems.

Looking for more practical security architecture insights? Check out my blog at myitbasics.com where I share technical deep dives on building secure, scalable systems.

After Asana's AI Breach: What It Takes to Deploy Production AI Agents Securely

Abraham Arellano Tavara — Sat, 18 Oct 2025 19:53:10 +0000

When Asana's Model Context Protocol server leaked data from ~1,000 organizations due to a session isolation flaw in May 2025, it crystallized a question I hear constantly from enterprise CTOs: "Can we actually deploy AI agents without creating the next security incident?"

After spending the past year deploying Amazon Bedrock AgentCore with customers across Europe—from 18-year-old SAP systems to regulated financial services—I've learned that moving AI agents from prototype to production isn't a framework problem. It's an infrastructure problem that most teams discover too late.

The 3-Month vs 6-Month Gap

Here's the uncomfortable pattern I see repeatedly:

3 months: Build an impressive AI agent demo
6 months: Solve infrastructure problems you didn't know existed

The gap isn't about choosing LangChain vs CrewAI or Claude vs GPT. It's about challenges that traditional application architectures never required.

The 4 Infrastructure Problems That Kill Production Deployments

1. Session Isolation (The Asana Problem)

The issue: Traditional stateless functions terminate after each request. AI agents maintain complex state across multiple interactions—conversation history, tool permissions, intermediate computations.

Real-world impact: Cross-tenant data contamination when one user's agent context bleeds into another's session.

Production solution: Each user session requires its own dedicated microVM with isolated compute, memory, and filesystem resources. Complete termination after session completion.

# What actually happens in production
runtime = Runtime.create(
    name="customer-agent",
    container_image="agent:latest",
    protocol="AGENT_CORE_RPC",
    memory_size_mb=4096,
    vcpus=2
)

# Each session gets its own isolated microVM
response = runtime.invoke(
    runtime_session_id="user-12345",  # Isolated session
    payload=json.dumps({"prompt": "Analyze Q4 financials"}).encode()
)

2. Long-Running Workflows

The issue: Research agents analyzing competitive intelligence or processing regulatory documents can't complete in Lambda's 15-minute window.

Real-world example: A financial services agent analyzing SEC filings needs to:

Fetch documents (5-10 min)
Parse and extract data (15-20 min)
Cross-reference with historical data (10-15 min)
Generate compliance report (5-10 min)

Total time: 35-55 minutes

What you need: Agent sessions lasting up to 8 hours for multi-step agentic workflows.

3. Identity Complexity

The issue: A single agent invocation might require:

OAuth authentication from the user
IAM roles for AWS resources
API keys for third-party services
All while maintaining proper permission boundaries

The gotcha I see constantly: OAuth token expiration during long-running sessions manifests as tool invocation failures after 60-90 minutes.

Production fix: Implement token refresh logic in your middleware rather than relying on cached credentials.

4. Observability for Non-Deterministic Systems

The challenge: When an agent produces unexpected results, you need to trace not just what happened, but why the foundation model made specific reasoning decisions across potentially dozens of tool invocations.

Traditional APM tools don't capture this level of detail.

The SAP Integration Reality

Here's a question from a recent architecture review: "Can AgentCore connect to our SAP ECC 6.0 system?"

The system: 18 years old, custom ABAP code, no REST APIs.

This is enterprise reality. Most production systems weren't designed for modern API consumption.

The pattern that works:

# AgentCore Gateway + Lambda middleware pattern
from agentcore import Gateway

sap_order_tool = Gateway.create_tool(
    name="check_order_status",
    description="Retrieve SAP order status using order number",
    lambda_function_arn="arn:aws:lambda:eu-central-1:123456:function:sap-rfc-connector",
    input_schema={
        "type": "object",
        "properties": {
            "order_number": {"type": "string"}
        },
        "required": ["order_number"]
    }
)

The Lambda function becomes your translation layer between the agent's expectations and SAP's proprietary RFC/BAPI protocols.

What actually fails in production:

Network timeouts between Lambda and on-premises SAP
OAuth token refresh during long sessions
SAP-specific error codes that agents can't interpret

Cost Reality Check

When a customer asked about costs for 1,000 conversations daily (5 messages each, 3 tool calls per message), here's what it looked like in Frankfurt region:

Runtime (2 vCPU, 4GB, 8-min avg sessions): ~$4,200/month
Gateway (15,000 tool calls daily): ~$225/month
Memory (5,000 events daily): ~$375/month
Observability (CloudWatch): ~$100/month

Total: ~$4,900/month

The comparison that matters: Building equivalent infrastructure in-house requires a senior engineer (€90K annually = €7,500/month) for 3+ months of development, plus ongoing maintenance.

Break-even point: 3 months

When AgentCore Makes Sense

✅ Yes, use it when:

Multi-tenant applications where session isolation is critical
Regulated industries with audit requirements (finance, healthcare)
Complex integrations across SAP, Salesforce, ServiceNow
OAuth identity requirements where agents act on behalf of users

❌ No, don't use it when:

High-frequency, sub-100ms latency requirements
Simple automation tasks (single database queries)
Budget constraints below $3-5K monthly
You need complete infrastructure control

The Architecture Insight That Changed Everything

AgentCore isn't competing with LangChain, CrewAI, or LlamaIndex.

AgentCore is the infrastructure those frameworks run on. Think Kubernetes for AI agents—you bring your framework and model, AgentCore provides production-grade runtime, security, and operational tooling.

GDPR Reality for European Markets

The critical gotcha I've seen catch multiple organizations:

AgentCore Memory supports both short-term event retention and long-term storage. By default, long-term memory persists indefinitely.

You must configure time-to-live policies to comply with GDPR's right to erasure.

My recommendation:

90-day retention for short-term memory
Explicit deletion workflows for long-term storage
Deploy in Frankfurt region (eu-central-1) for data residency

The "Start Simple" Pattern That Works

Based on successful deployments:

Week 1-3: Prototype in free tier (until Sep 16, 2025)

Build agent using your preferred framework
Deploy to AgentCore Runtime
Integrate 2-3 tools through Gateway

Week 4-10: Pilot with 100-500 users

Monitor costs and observability
Refine tool integrations
Gather user feedback

Week 11+: Production rollout

Start with one use case
Expand based on ROI
Implement memory strategies

The teams that struggled: Tried to migrate entire application portfolios at once without understanding cost implications.

Key Takeaways

Session isolation isn't optional for multi-tenant agents. The Asana incident demonstrated what happens when isolation fails.
Integration complexity compounds quickly. Every additional backend system adds authentication layers, error handling, and monitoring. Gateway's automatic conversion eliminates months of work.
Production agents require production infrastructure. Memory management, observability, and identity controls aren't features you add later—they're foundational.

Discussion Questions

I'd love to hear your perspective:

Have you deployed AI agents in production? What infrastructure challenges surprised you?
For those running SAP or legacy systems—how are you handling integration?
What's your biggest concern: security, cost, or complexity?

Full technical deep-dive (with code examples, architecture diagrams, and cost breakdowns):
👉 Read the complete guide on MyITBasics

This covers:

Complete SAP integration architecture with authentication flows
Regional deployment strategies for GDPR compliance
Debugging common production issues
Implementation quickstart with Dockerfile
All 7 AgentCore services explained in detail

Abraham Arellano Tavara | Senior Strategic Solutions Architect, AWS Munich

I Tested GPU Time-Slicing With Real LLMs So You Don't Have To 🚀

Abraham Arellano Tavara — Mon, 29 Sep 2025 19:11:58 +0000

I Tested GPU Time-Slicing With Real LLMs So You Don't Have To 🚀

🎯 TL;DR - The Numbers Don't Lie

I spent a week testing NVIDIA time-slicing on AWS EKS with real LLM workloads (not toy examples). Here's what actually happens:

✅ Time-slicing overhead: Only ~1% (NVIDIA crushed this)
❌ Concurrent workloads: 50-100% performance degradation (physics can't be cheated)
💰 Cost savings: 50% reduction for sequential workloads
🎯 Best use: Dev/test environments, time-shifted workloads

Bottom line: Time-slicing is brilliant for isolation, terrible for concurrent performance.

📦 Full code, configs, and test scripts: GitHub Repository

🔑 Quick Reference - Key Terms

Before we dive deep, here's your decoder ring:

Term	What It Means	Why You Care
Time-Slicing	GPU virtualization creating multiple virtual GPUs from one physical GPU	Lets multiple apps share a GPU
OOM	Out Of Memory - when GPU runs out of VRAM	Your pods crash mysteriously
TGI	Text Generation Inference - HuggingFace's LLM serving engine	Industry standard for serving models
Concurrent	Multiple workloads running simultaneously	Where performance degradation happens
Sequential	Workloads running one after another	Where time-slicing shines

💸 The $500 Question That Started This

Picture this: You're running two LLM models in production. That's $2/hour for two GPU instances. Over a month, that's $1,440. Your CFO is asking why the GPU bill is so high.

Then someone mentions NVIDIA time-slicing: "Just share one GPU between both models!"

The question everyone asks: Does this actually work without destroying performance?

The answer everyone gives: "It depends..." (not helpful)

So I decided to test it with real production workloads and actual performance measurement. No toy examples. No theoretical benchmarks. Just two real LLMs hammering a shared GPU.

Spoiler: The results surprised me.

🏗️ The Test Lab Setup

Here's what I built for this experiment:

🎮 The Hardware

GPU: NVIDIA L40S (46GB VRAM) - The new hotness
Instance: g6e.2xlarge (~$1.01/hour in us-west-2)
Cost: Much cheaper than p3.8xlarge ($12.24/hour)
Kubernetes: EKS 1.32 with NVIDIA GPU Operator

🤖 The Contenders

Model A: Microsoft Phi-3.5-mini-instruct

Size: ~4GB memory footprint
Speed: Fast inference (< 1 second)
Use case: Quick responses, high throughput

Model B: DeepSeek-R1-Distill-Llama-8B

Size: ~8GB memory footprint
Speed: Slower but more thoughtful (~1 second)
Use case: Complex reasoning, detailed outputs

Both running: HuggingFace Text Generation Inference (TGI) 3.3.4

💡 Why these models? They represent real production workloads - different sizes, different performance profiles, and combined they use ~12GB (26% of available 46GB).

🔥 The 3 Mistakes I Made (So You Don't Have To)

Mistake #1: "GPUs Just Work™" (They Don't)

What I expected: Spin up g6e.2xlarge, GPU drivers already installed (like p3 instances)

What actually happened: No GPU detected. Pods stuck in Pending. Panic.

kubectl describe pod
# Events: 0/1 nodes available: insufficient nvidia.com/gpu

The plot twist: Unlike p3 instances, g6e.2xlarge doesn't come with pre-installed NVIDIA drivers in EKS managed node groups.

The fix that saved the day:

# NVIDIA GPU Operator does ALL the heavy lifting
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set nodeSelector.eks-node=gpu \
  --wait

This magical operator automatically:

✅ Installs NVIDIA drivers
✅ Configures container toolkit
✅ Deploys device plugin
✅ Sets up GPU feature discovery

💡 Pro tip: Always use GPU Operator for modern EKS setups. Manual driver installation is pain.

Mistake #2: "Just Deploy Both Models" (OOM Speedrun)

What I tried: Deploy both models with default settings

What happened: Both pods started... then crashed with cryptic errors

RuntimeError: CUDA out of memory. Tried to allocate 20.00 GiB

The problem: Each model tried to grab ~80% of GPU memory. Math doesn't work:

Model A: 80% × 46GB = 36.8GB
Model B: 80% × 46GB = 36.8GB
Total needed: 73.6GB
Available: 46GB ❌

The fix: Aggressive memory limits per model

args:
  - "--cuda-memory-fraction"
  - "0.4"  # 🎯 Only use 40% GPU memory per model
  - "--max-batch-prefill-tokens"
  - "4096"  # ⚠️ Reduced from default 8192
  - "--max-input-length"
  - "256"  # 🔒 Limit input size
  - "--max-total-tokens"
  - "512"  # 🔒 Limit output size

The math that works:

Model A: 40% × 46GB = 18.4GB ✅
Model B: 40% × 46GB = 18.4GB ✅
Total: 36.8GB (80% utilization) ✅
System overhead: 20% buffer ✅

🚨 Critical setting: Without cuda-memory-fraction, models will OOM during warmup. This isn't optional!

Mistake #3: "Time-Slicing Config Is Obvious" (It's Not)

What the docs say: Create a ConfigMap

What they don't say: You need TWO ConfigMaps and an operator upgrade

The complete configuration:

# ConfigMap 1: Time-slicing configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 10  # 🎯 10 virtual GPUs from 1 physical

---
# ConfigMap 2: Device plugin config
apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
        - name: nvidia.com/gpu
          replicas: 10

Then upgrade the operator:

helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set devicePlugin.config.name=device-plugin-config \
  --wait

Verify it worked:

kubectl describe node <gpu-node> | grep nvidia.com/gpu

# Before:  nvidia.com/gpu: 1  ❌
# After:   nvidia.com/gpu: 10 ✅

🎉 Success: Your cluster now advertises 10 virtual GPUs instead of 1!

What this means: You can now schedule 10 pods requesting nvidia.com/gpu: 1 on a single physical GPU.

📊 The Results (Prepare to Be Surprised)

Test Scenario 1: Individual Performance (No Competition)

First, I tested each model alone with time-slicing enabled. Would time-slicing itself add overhead?

Phi-3.5-Mini Flying Solo

Configuration	Avg Latency	Throughput	Success Rate
Time-sliced GPU	0.609s	98.44 req/min	100% ✅
Exclusive GPU	0.603s	99.46 req/min	100% ✅
Overhead	+0.006s	-1.02 req/min	0%

Overhead: ~1% 🎉

DeepSeek-R1 Flying Solo

Configuration	Avg Latency	Throughput	Success Rate
Time-sliced GPU	1.135s	52.84 req/min	100% ✅
Exclusive GPU	1.142s	52.49 req/min	100% ✅
Overhead	-0.007s	+0.35 req/min	0%

Overhead: ~1% (actually slightly faster!) 🤯

💡 Key Insight #1: NVIDIA time-slicing overhead is negligible. The virtualization layer is incredibly efficient. This is exceptional engineering.

Test Scenario 2: Concurrent Performance (The Real Test)

Now both models hitting the GPU simultaneously. Every request from both models at the same time.

This is where reality hits.

Phi-3.5-Mini Under Fire

Metric	Baseline	Concurrent	Impact
Latency	0.609s	1.227s	🔴 +101.4%
Throughput	98.44 req/min	48.89 req/min	🔴 -50.3%
Success Rate	100%	100%	✅ Still stable

DeepSeek-R1 Under Fire

Metric	Baseline	Concurrent	Impact
Latency	1.135s	1.778s	🔴 +56.6%
Throughput	52.84 req/min	33.74 req/min	🔴 -36.1%
Success Rate	100%	100%	✅ Still stable

🚨 Key Insight #2: Resource competition is BRUTAL. When both models compete for the same GPU, performance tanks by 50-100%.

📈 Visual Performance Comparison

Individual Performance (Time-Slicing Overhead)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Exclusive GPU:    ████████████████████ 100%
Time-Sliced GPU:  ███████████████████░ 99%
                  ↑ Only 1% difference!

Concurrent Performance (Resource Competition)  
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Baseline:         ████████████████████ 100%
Concurrent:       ██████████░░░░░░░░░░ 50%
                  ↑ Ouch. Physics can't be cheated.

🤔 Why This Happens (The Physics)

Time-slicing overhead (~1%):

✅ Context switching is fast
✅ Memory isolation is efficient
✅ Scheduling overhead is minimal

Resource competition (50-100% degradation):

❌ Both models fight for GPU cores
❌ Memory bandwidth saturation
❌ L2 cache thrashing
❌ Shared memory contention

The verdict: Time-slicing technology is brilliant. GPU resource sharing is expensive.

🎯 The Decision Framework (Should YOU Use Time-Slicing?)

✅ Perfect Use Cases - Deploy With Confidence

1. Development & Testing Environments 🧪

Scenario: QA team needs to test 3 model versions
Cost without time-slicing: $3/hour (3 GPUs)
Cost with time-slicing: $1/hour (1 GPU)
Savings: $1,440/month
Performance impact: None (sequential testing)
Verdict: Slam dunk ✅

2. Time-Shifted Workloads ⏰

Scenario: Model A (business hours), Model B (batch processing at night)
Overlap: < 10% of time
Performance: 99% (negligible overhead when not competing)
Savings: 50% GPU costs
Verdict: Perfect fit ✅

3. Demo & POC Deployments 🎬

Scenario: Sales demo with multiple model comparisons
Requirements: Not production, occasional use
Budget: Limited
Performance needs: "Good enough"
Verdict: Ideal use case ✅

4. CI/CD Model Testing 🔄

Scenario: Automated model validation pipelines
Pattern: Sequential test runs
Peak load: One test at a time
Cost optimization: Critical
Verdict: Great match ✅

❌ Terrible Use Cases - Avoid These

1. Production Inference Serving 💼

Scenario: Customer-facing API with SLA requirements
Requirement: < 100ms response time
Concurrent load: Unpredictable spikes
Impact: 50-100% degradation = SLA violations
Verdict: Don't even think about it ❌

2. High-Throughput Concurrent Workloads 🚀

Scenario: Multiple models serving real-time traffic
Load pattern: Constant concurrent requests
Performance impact: Immediate 50% throughput loss
Business impact: Lost revenue, poor UX
Verdict: Hard pass ❌

3. Latency-Sensitive Applications ⚡

Scenario: Real-time chat, autocomplete, voice assistants
SLA: Sub-second responses required
Concurrent degradation: Doubles latency
User impact: Frustrated users, high churn
Verdict: Nope ❌

4. Auto-Scaling Production Workloads 📈

Scenario: Traffic scales unpredictably
Problem: Can't predict when models compete
Risk: Performance collapse during peak times
Business impact: Revenue loss during high-traffic
Verdict: Too risky ❌

🤔 Decision Tree - Find Your Path

Start Here
    │
    ├─ Is this production? ─── YES ──→ Will workloads overlap?
    │                                       │
    │                                       ├─ YES ──→ ❌ Don't use time-slicing
    │                                       │
    │                                       └─ NO ───→ ✅ Consider time-slicing
    │
    └─ NO (Dev/Test) ─────────────────────→ ✅ Use time-slicing
                                                 (perfect use case!)

💰 ROI Calculator - Your Break-Even Analysis

Scenario	Without Time-Slicing	With Time-Slicing	Monthly Savings
2 Models, Sequential	$1,440	$720	$720 ✅
2 Models, 30% Overlap	$1,440	$720	$720 (but some degradation) ⚠️
2 Models, 50% Overlap	$1,440	$720	$720 (significant degradation) ❌
2 Models, Always Concurrent	$1,440	$720	Not worth it ❌

Break-even point: If your workloads overlap < 30% of the time, time-slicing typically provides net positive value.

💡 Pro Tip: Monitor actual workload overlap in production before deciding. Use CloudWatch metrics to track GPU utilization patterns.

🧪 How I Tested This (Reproducible Science)

The Testing Strategy

I built an automated framework to eliminate human error and ensure reproducible results:

Test Protocol:

☝️ Test each model individually (establish baseline)
✌️ Test both models concurrently (measure degradation)
🔁 Repeat 3 times with 5 different prompts (45 requests total)
📊 Calculate statistical averages and impact percentages

The Automation Script

Here's the core testing logic (simplified):

#!/bin/bash
# Complete performance testing framework

test_individual_model() {
    local endpoint=$1
    local model_name=$2

    # Test prompts covering different complexity levels
    local prompts=(
        "Explain machine learning"
        "What is Python programming"
        "Describe cloud computing"
        "How does AI work"
        "What are automation benefits"
    )

    # Run 3 iterations for statistical accuracy
    for iteration in $(seq 1 3); do
        for prompt in "${prompts[@]}"; do
            # Measure with millisecond precision
            start_time=$(date +%s.%N)

            response=$(curl -s -X POST "$endpoint/generate" \
                -H "Content-Type: application/json" \
                -d "{
                    \"inputs\": \"$prompt\",
                    \"parameters\": {
                        \"max_new_tokens\": 50,
                        \"temperature\": 0.7
                    }
                }")

            end_time=$(date +%s.%N)
            duration=$(echo "$end_time - $start_time" | bc)

            # Record results
            echo "$duration" >> "${model_name}_results.txt"
        done
    done

    # Calculate statistics
    calculate_stats "${model_name}_results.txt"
}

test_concurrent_models() {
    # Fire both requests simultaneously using background jobs
    for prompt in "${prompts[@]}"; do
        # Model A request
        {
            measure_latency "$PHI35_ENDPOINT" "$prompt" >> phi_concurrent.txt
        } &

        # Model B request  
        {
            measure_latency "$DEEPSEEK_ENDPOINT" "$prompt" >> deepseek_concurrent.txt
        } &

        # Wait for both to complete
        wait
    done
}

Kubernetes Scaling for Test Control

The genius part: Using Kubernetes to control test scenarios:

# Test Phi-3.5 alone
kubectl scale deployment deepseek-r1-baseline --replicas=0 -n llm-testing
# Wait 30 seconds for graceful shutdown
./load_test.sh

# Test DeepSeek alone
kubectl scale deployment mistral-7b-baseline --replicas=0 -n llm-testing
kubectl scale deployment deepseek-r1-baseline --replicas=1 -n llm-testing
# Wait 30 seconds for startup
./load_test.sh

# Test both concurrently
kubectl scale deployment mistral-7b-baseline --replicas=1 -n llm-testing
# Wait 30 seconds for startup
./load_test.sh

💡 Why this works: Scaling deployments ensures clean test isolation without manual intervention or pod management.

What Made This Scientific

✅ Controlled environment: No other GPU workloads running
✅ Multiple iterations: 3 runs × 5 prompts = statistical validity
✅ Standardized prompts: Same inputs across all tests
✅ Consistent parameters: Same token limits, temperature
✅ Automated execution: Eliminates human timing errors
✅ Millisecond precision: Accurate latency measurement

Sample Output

=== Phi-3.5-Mini (Individual Baseline) ===
Total Requests: 15
Successful: 15 (100%)
Average Latency: 0.609s
Throughput: 98.44 req/min

=== Phi-3.5-Mini (Concurrent) ===
Average Latency: 1.227s (+101.4% 🔴)
Throughput: 48.89 req/min (-50.3% 🔴)

Report saved: test_results/GPU_SLICING_FULL_performance_report_20250725_095710.txt

📦 Get the complete testing framework: GitHub Repository

💰 The Money Talk - Real ROI Analysis

Let's talk dollars and cents. Because at the end of the day, your CFO cares about the bottom line.

Scenario 1: Traditional Approach (Separate GPUs)

┌─────────────────────────────────┐
│  Model A: g6e.2xlarge           │
│  Cost: $1.01/hour               │
│  Performance: 100% ✅            │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│  Model B: g6e.2xlarge           │
│  Cost: $1.01/hour               │
│  Performance: 100% ✅            │
└─────────────────────────────────┘

Total: $2.02/hour = $1,454/month

Scenario 2: Time-Slicing (Sequential Workloads)

┌─────────────────────────────────┐
│  Single g6e.2xlarge             │
│                                 │
│  Model A (9am-5pm)  ──────┐    │
│  Model B (6pm-8am)  ──────┤    │
│                                 │
│  Cost: $1.01/hour               │
│  Performance: 99% ✅             │
└─────────────────────────────────┘

Total: $1.01/hour = $727/month
Savings: $727/month (50% reduction! 🎉)

When this works: Workloads naturally time-shifted (batch processing, different timezones, dev/staging)

Scenario 3: Time-Slicing (Concurrent Workloads)

┌─────────────────────────────────┐
│  Single g6e.2xlarge             │
│                                 │
│  Model A + Model B (competing)  │
│                                 │
│  Cost: $1.01/hour               │
│  Performance: 50% ⚠️             │
└─────────────────────────────────┘

Total: $1.01/hour = $727/month
Savings: $727/month
Trade-off: 50% performance loss 💀

When this fails: Production inference, customer-facing APIs, latency-sensitive applications

The Financial Break-Even Matrix

Workload Overlap	Cost Savings	Performance	Recommended?
0-10% (mostly sequential)	50% ✅	99% ✅	Yes 🎯
10-30% (occasional overlap)	50% ✅	80-90% ⚠️	Maybe 🤔
30-50% (frequent overlap)	50% ✅	60-80% ⚠️	Risky 😬
50%+ (mostly concurrent)	50% ❌	50% ❌	No 🚫

Real-World Cost Example (My Consulting Client)

Their Setup:

Dev environment: 2 models for A/B testing
Usage pattern: Sequential (test Model A, then Model B)
Previous cost: $1,440/month (2 GPUs)

After Time-Slicing:

New cost: $720/month (1 GPU)
Performance: 99% (negligible overhead)
Savings: $8,640/year 💰

CFO's reaction: "Why weren't we doing this before?"

The Hidden Costs of Getting It Wrong

Mistake: Using time-slicing for production inference

Scenario: E-commerce chatbot with strict SLA (< 500ms response)

Before time-slicing:
Response time: 400ms ✅
Conversion rate: 12% ✅
Revenue impact: $0

After time-slicing (concurrent load):
Response time: 800ms ❌ (SLA breach)
Conversion rate: 8% ❌ (users bounce)
Revenue impact: -$50,000/month 💀

Lesson: The $720/month GPU savings cost them $50,000/month in revenue. Not worth it.

Your ROI Decision Tree

Question 1: Are your workloads production-facing?
    │
    ├─ NO ──→ Question 2: Do workloads overlap?
    │           │
    │           ├─ NO ──→ ✅ Use time-slicing (50% savings!)
    │           │
    │           └─ YES ──→ ⚠️ Prototype and measure first
    │
    └─ YES ──→ Question 3: Can you tolerate 50% performance loss?
                │
                ├─ NO ──→ ❌ Don't use time-slicing
                │
                └─ YES ──→ 🤔 Are you SURE? Measure twice, deploy once.

💡 Pro Tip: Always prototype with time-slicing in staging before production. Measure actual performance impact with YOUR workloads, not theoretical benchmarks.

🚀 Quick Start - Get Running in 30 Minutes

Want to try this yourself? Here's the exact path I followed.

Prerequisites Check ✅

# Verify you have these tools installed
kubectl version --client
helm version
eksctl version
aws --version

# If any are missing, install from:
# kubectl: https://kubernetes.io/docs/tasks/tools/
# helm: https://helm.sh/docs/intro/install/
# eksctl: https://eksctl.io/installation/
# aws: https://aws.amazon.com/cli/

Step 1: Create EKS Cluster (15 minutes)

# Create cluster configuration file
cat << 'EOF' > cluster-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: gpusharing-demo
  region: us-west-2
  version: "1.32"
nodeGroups:
  - name: main
    instanceType: t3.large
    desiredCapacity: 2
    minSize: 2
    maxSize: 4
  - name: gpu
    instanceType: g6e.2xlarge
    desiredCapacity: 1
    minSize: 1
    maxSize: 1
    labels:
      eks-node: gpu
EOF

# Create the cluster (takes ~15 minutes)
eksctl create cluster -f cluster-config.yaml

# Verify nodes are ready
kubectl get nodes

What you'll see:

NAME                         STATUS   ROLE    AGE
ip-192-168-1-1...            Ready    <none>  5m    # t3.large
ip-192-168-1-2...            Ready    <none>  5m    # t3.large  
ip-192-168-1-3...            Ready    <none>  5m    # g6e.2xlarge (GPU!)

Step 2: Install NVIDIA GPU Operator (5 minutes)

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator (this does ALL the heavy lifting)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set nodeSelector.eks-node=gpu \
  --wait

# Verify installation (all pods should be Running)
kubectl get pods -n gpu-operator

Wait for all pods to show 1/1 Running (takes 2-3 minutes)

Step 3: Enable Time-Slicing (3 minutes)

# Download complete configuration
wget https://raw.githubusercontent.com/AbrahamArellano/eks-shared-gpu-ai-performance/main/infra/time-slicing-config.yaml

# Apply time-slicing configuration
kubectl apply -f time-slicing-config.yaml

# Upgrade GPU operator with time-slicing
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set devicePlugin.config.name=device-plugin-config \
  --wait

Verify it worked:

kubectl describe node $(kubectl get nodes -l eks-node=gpu -o jsonpath='{.items[0].metadata.name}') | grep "nvidia.com/gpu:"

# Expected output:
#  nvidia.com/gpu:     10  ✅ (not 1!)

Step 4: Deploy Your Models (5 minutes)

# Create namespace
kubectl create namespace llm-testing

# Clone the complete repository
git clone https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance.git
cd eks-shared-gpu-ai-performance

# Deploy both models with memory-optimized configs
kubectl apply -f models/mistral-memory-optimized.yaml
kubectl apply -f models/deepseek-memory-optimized.yaml

# Watch pods start (takes 2-3 minutes to download models)
kubectl get pods -n llm-testing -w

Wait for both pods to show 1/1 Running

Step 5: Run Performance Tests (2 minutes)

# Port forward to access models locally
kubectl port-forward svc/mistral-7b-service 8081:8080 -n llm-testing &
kubectl port-forward svc/deepseek-r1-service 8082:8080 -n llm-testing &

# Run the complete test suite
cd tests
chmod +x load_test.sh
./load_test.sh

Output you'll see:

=== Complete GPU Time-Slicing Performance Analysis ===
Testing Phi-3.5-Mini (Individual Baseline)...
  ✓ Test 1: 0.610s
  ✓ Test 2: 0.602s
  ...

Testing DeepSeek-R1 (Individual Baseline)...
  ✓ Test 1: 1.142s
  ...

Testing Both Models Concurrently...
  ✓ Both completed
  ...

Report saved: test_results/performance_report_YYYYMMDD_HHMMSS.txt

Step 6: View Your Results

# View the latest report
cat tests/test_results/performance_report_*.txt | tail -30

You'll see something like this:

=== Phi-3.5-Mini Individual Baseline ===
Average Latency: 0.609s
Throughput: 98.44 req/min

=== Phi-3.5-Mini Concurrent Performance ===
Average Latency: 1.227s
Performance Impact: +101.4% latency 🔴

🎉 Success! You've Now:

✅ Created an EKS cluster with GPU support
✅ Enabled NVIDIA time-slicing (10 virtual GPUs)
✅ Deployed two real LLM models
✅ Measured actual performance impact
✅ Generated comprehensive performance reports

Cleanup (Don't Forget!)

# Delete the entire cluster to avoid charges
eksctl delete cluster gpusharing-demo --region us-west-2

# Verify deletion
aws eks list-clusters --region us-west-2

⚠️ Important: Running this setup costs ~$1.20/hour. Don't forget to delete when done!

Troubleshooting Common Issues

Problem: Pods stuck in Pending

# Check if GPU is detected
kubectl describe node <gpu-node> | grep nvidia.com/gpu

# If shows 0, restart device plugin
kubectl rollout restart daemonset/nvidia-device-plugin-daemonset -n gpu-operator

Problem: Models crash with OOM

# Check cuda-memory-fraction in deployment
kubectl describe deployment mistral-7b-baseline -n llm-testing

# Should see: --cuda-memory-fraction 0.4
# If not, update the YAML and reapply

Problem: Can't access models via port-forward

# Check if services exist
kubectl get svc -n llm-testing

# Check if pods are ready
kubectl get pods -n llm-testing

# Restart port-forward
pkill -f port-forward
kubectl port-forward svc/mistral-7b-service 8081:8080 -n llm-testing &

📚 Next Steps

Experiment: Try different models from HuggingFace
Optimize: Tune memory fractions for your workloads
Monitor: Set up CloudWatch for GPU metrics
Scale: Add more GPU nodes if needed

Complete implementation guide: GitHub Repository

💡 5 Things I Wish I Knew Before Starting

1. "Pre-installed Drivers" Doesn't Mean What You Think

What I assumed: g6e instances come with NVIDIA drivers like p3 instances

Reality check: Spent 2 hours debugging why pods couldn't see the GPU

The lesson: Always use GPU Operator for modern EKS setups. It's not optional—it's essential.

Time saved for you: 2 hours of confusion 😅

2. Memory Limits Are Not Suggestions

What I did first: Deployed models with default settings

What happened: Both models tried to grab 80% of GPU memory each

The crash: CUDA out of memory errors everywhere

The fix: cuda-memory-fraction: 0.4 is your best friend

Lesson: In GPU sharing, aggressive memory limits aren't pessimistic—they're realistic.

3. Time-Slicing ≠ Magic Performance Multiplier

Marketing says: "Share one GPU across multiple workloads!"

Reality says: "Share one GPU across multiple workloads... but not at full speed concurrently"

The truth: Time-slicing provides isolation, not performance multiplication.

Mental model: Think of it like time-sharing a CPU, not adding more cores.

4. Test Sequential Before Assuming Concurrent

My mistake: Assumed concurrent workloads would work "well enough"

The numbers: 50-100% performance degradation

The learning: Always measure YOUR workloads with YOUR patterns

Pro tip: Use Kubernetes scaling to isolate test scenarios cleanly

5. Production ≠ Development (Obvious, But...)

Development: Time-slicing is perfect

Cost savings? Yes ✅
Performance trade-offs? Acceptable ✅
Stability? Excellent ✅

Production: Time-slicing is risky

SLA requirements? Violated ❌
Unpredictable performance? Dangerous ❌
Customer experience? Compromised ❌

The rule: If it touches paying customers, provision separate GPUs.

🎬 The Verdict - Should You Use Time-Slicing?

After a week of testing, thousands of inference requests, and countless hours of analysis, here's my honest take:

✅ Time-Slicing Is Brilliant For:

Development environments where cost matters more than peak performance
Sequential workloads with natural time-shifting patterns
A/B testing where models don't compete simultaneously
POC/Demo environments with flexible requirements
Learning and experimentation without breaking the bank

ROI: 50% cost savings with 99% performance ✅

❌ Time-Slicing Is Terrible For:

Production inference serving customer traffic
Concurrent workloads with strict SLA requirements
Latency-sensitive applications where milliseconds matter
Revenue-generating systems where performance = money
Auto-scaling workloads with unpredictable patterns

Risk: 50-100% performance degradation = unhappy customers ❌

The Technology Itself? 🏆 A+ Engineering

NVIDIA absolutely crushed the implementation:

Only ~1% overhead from time-slicing mechanism
Rock-solid stability (zero crashes in extensive testing)
Clean Kubernetes integration
Production-grade reliability

The performance degradation comes from physics, not technology.

You can't cheat the fundamental limitations of shared resources. Time-slicing doesn't create more GPU compute—it manages access to existing compute.

🚀 Your Next Steps

If You're Convinced (Dev/Test Use Case):

⭐ Star the repo: GitHub Repository
🔧 Follow the Quick Start: 30 minutes to working setup
📊 Run your own tests: Measure YOUR workloads
💰 Calculate YOUR ROI: Use the decision framework
🎉 Deploy and save money: Start with dev environments

If You're Skeptical (Production Use Case):

✅ Provision separate GPUs: Safety first
🧪 Test time-slicing in staging: Validate with real traffic patterns
📈 Monitor overlap patterns: Measure actual concurrent load
🤔 Reconsider for off-peak: Maybe time-slice during low-traffic hours?

If You're Curious (Learning Mode):

📖 Read the full guide: Complete blog post
🎓 Understand the concepts: Time-slicing vs MIG vs MPS
🛠️ Experiment safely: Use the provided test framework
💬 Share your findings: Comment below with your results

📚 Complete Resource Library

Code & Configuration

📦 GitHub Repository: eks-shared-gpu-ai-performance
- Complete Kubernetes manifests
- Automated testing framework
- Performance analysis scripts
- Troubleshooting guides

Deep Dive Content

📝 Full Technical Analysis: MyITBasics.com
🏗️ Architecture Patterns: Complete infrastructure setup guide
🔍 Performance Analysis: Detailed metrics and methodology
💡 Best Practices: Production-ready recommendations

💬 Let's Discuss - Your Turn!

I've shared my findings. Now I want to hear yours:

💭 Questions for the community:

Have you used GPU time-slicing in production? What was your experience?
What workload patterns are you trying to optimize?
Any other GPU sharing strategies you've found effective?
Found bugs or improvements in my testing methodology?

🐛 Found an issue in the code?
Open an issue or PR on GitHub

💡 Want to discuss your specific use case?
Drop a comment below—I read and respond to all of them!

📧 Need consulting help?
Visit MyITBasics.com for architecture guidance

🙏 Thanks for Reading!

If you found this helpful:

⭐ Star the GitHub repo to bookmark for later
💬 Comment below with your experiences or questions
🔄 Share this post with your team
👤 Follow me for more deep-dives into GPU architecture, AI infrastructure, and cloud-native engineering

Coming up next: Multi-GPU strategies, MIG vs time-slicing comparison, and cost optimization techniques for production AI workloads.

Stay tuned! 🚀

Built with curiosity, tested with rigor, shared with the community.

— Abraham Arellano
Cloud Architect & AI Infrastructure Engineer
MyITBasics.com | GitHub

Clinical AI Engineering: Building Production-Ready Healthcare NLP Infrastructure

Abraham Arellano Tavara — Sun, 14 Sep 2025 20:11:24 +0000

Ever wondered what happens when you try to reproduce a healthcare AI research paper? We discovered that you end up building significantly more infrastructure than initially expected!

The Challenge: Research vs. Reality

My colleague Umesh Kumar and I set out to reproduce "Do We Still Need Clinical Language Models?" for our UIUC Master's course Deep Learning for Healthcare. What started as a simple validation project turned into a deep dive into production-ready healthcare NLP infrastructure.

The core question seemed straightforward:

Do specialized clinical models (BioClinicalBERT) still outperform general models (RoBERTa, T5) on medical NLP tasks?

But implementing a system to reliably answer this across 3 clinical tasks, multiple model architectures, and 25,000+ text samples revealed the massive gap between research papers and production systems.

What We Built 🏗️

The Clinical NLP Battleground

We evaluated models across three real-world healthcare tasks:

Task	Challenge	Real-World Use
MedNLI	Medical reasoning	Clinical decision support
RadQA	Information extraction	Finding answers in medical records
CLIP	Multi-label classification	Routing patient communications

The Infrastructure Reality Check

Here's what the papers don't tell you about building clinical NLP systems:

PhysioNet credentialing for each dataset (regulatory compliance is real!)
Memory management across different model architectures
Dynamic batch sizing to prevent OOM crashes
Mixed precision training on Tesla T4 GPUs
Configuration management for systematic hyperparameter exploration

Key Findings That Matter 📊

1. Fine-Tuning Still Wins (By A Lot)

BioClinicalBERT Performance:
├── Fine-tuned: 0.793 accuracy (MedNLI)
└── In-Context Learning: 0.374 accuracy

The hype around prompt-based learning? Our findings suggest it needs more development for clinical tasks.

2. Task-Specific Model Selection

Models that performed excellently on medical reasoning didn't automatically excel at information extraction. One size doesn't fit all in healthcare AI.

3. Production Efficiency Insights

Clinical models like BioClinicalBERT needed fewer training epochs to reach optimal performance compared to adapted general models. This translates to real cost savings in production!

The Engineering Deep Dive 🔧

Modular Architecture That Actually Works

# Clean separation of concerns
clinical_tasks/
├── mednli/          # Medical reasoning
├── radqa/           # Question answering  
├── clip/            # Multi-label classification
└── shared/          # Common infrastructure

Configuration-Driven Everything

YAML configs that handle:

Model-specific parameters
Task-specific preprocessing
Environment-aware resource management
Automatic batch size adjustment

Error Handling for the Real World

Because healthcare AI can't just crash when it hits an edge case:

Graceful OOM recovery
Comprehensive logging
Resource monitoring
Validation safeguards

Why This Matters for Healthcare AI 🎯

This isn't just another research reproduction. We're talking about:
✅ Reproducible research infrastructure that others can build on
✅ Production-ready patterns for healthcare AI teams
✅ Open-source implementation advancing the community
✅ Regulatory-compliant data handling approaches

The Bottom Line

Specialized clinical models still matter. General models aren't ready to replace domain-specific healthcare AI, especially when accuracy can impact patient care.

But more importantly: the gap between research and production in healthcare AI is huge. Building bridges requires thinking about infrastructure, compliance, efficiency, and maintainability from day one.

Want the Full Technical Deep Dive?

I've written a comprehensive breakdown covering:

Detailed architecture decisions
Performance benchmarking across all models
Computational efficiency analysis
Production deployment guidance
Complete open-source implementation

👉 Read the full article: Clinical AI Engineering - Building Production-Ready Healthcare NLP Infrastructure

🔗 Check out the complete implementation on GitHub

What's your experience with healthcare AI in production? Have you faced similar challenges bridging research and deployment? Drop your thoughts in the comments! 👇

HealthcareAI #ClinicalNLP #MachineLearning #ProductionAI

How to build a Multi-Agent Financial Intelligence with AWS and SAP

Abraham Arellano Tavara — Fri, 24 Jan 2025 19:31:43 +0000

Three days. That's what it took to build a sophisticated financial intelligence demo orchestrating three specialized MCP servers using AWS Strands and SAP Generative AI Hub. The result? A complete demo for SAP TechEd to showcase a 30% potential reduction in financial analysis time.

Not because building agentic systems is trivial, but because integrating AWS and SAP's generative AI stacks with the right architectural decisions makes complex demo scenarios tractable.

The Challenge: Demonstrating Enterprise AI Integration

Most AI agent tutorials showcase simple, single-tool agents. But demonstrating enterprise-grade AWS and SAP integration requires more:

Multiple data sources requiring specialized processing
Cross-system coordination without hardcoded workflows
Production-grade patterns and governance
Observable, maintainable architectures

When creating our Devtoberfest session on building multi-tool research agents, we wanted to demonstrate real enterprise integration patterns—showcasing how SAP's Generative AI Hub connects with AWS Bedrock through the AWS Strands SDK.

The Foundation: Research Agent with AWS Strands

We started with a deep research agent demo using the AWS Strands Agents SDK and Tavily API for web intelligence:

from strands import Agent, tool

@tool
def web_search(query: str, time_range: str = None) -> str:
    """Search the web and return ranked results"""
    results = tavily_client.search(
        query=query,
        max_results=10,
        time_range=time_range
    )
    return format_search_results(results)

@tool  
def web_extract(urls: list[str]) -> str:
    """Extract full page content from URLs"""
    return tavily_client.extract(urls=urls)

@tool
def web_crawl(url: str, instructions: str = None) -> str:
    """Crawl websites and discover nested links"""
    return tavily_client.crawl(
        url=url,
        max_depth=2,
        instructions=instructions
    )

# Create the agent
deep_researcher_agent = Agent(
    model=bedrock_model,
    system_prompt=RESEARCH_SYSTEM_PROMPT,
    tools=[web_search, web_extract, web_crawl, format_research_response]
)

What makes AWS Strands different? It's model-driven, not workflow-driven. You provide tools and a system prompt—the LLM handles planning, reasoning, and orchestration. This shifts complexity from code into the model's weights.

Built-in Production Observability

AWS Strands automatically tracks critical metrics using OpenTelemetry:

Metric Category	What It Tracks	Demo Value
Token Usage	Input/output/total tokens	Cost estimation
Performance	Latency and execution times	Benchmark tracking
Tool Usage	Call counts and success rates	Reliability assessment
Event Loops	Reasoning cycles	Efficiency analysis

This integrates seamlessly with AWS X-Ray and CloudWatch for enterprise observability patterns.

The Innovation: Multi-Server Financial Intelligence Demo

Our demo showcases financial analysis requiring coordination of multiple specialized systems. That's where Model Context Protocol (MCP) becomes critical.

Understanding MCP: The USB-C for AI

Anthropic open-sourced MCP in November 2024 to solve the "N×M problem"—every model needing connectors to every data source.

MCP provides a universal standard: One protocol, any model, any data source. Major providers including OpenAI and Google DeepMind adopted it within months.

The protocol uses JSON-RPC 2.0 with three primitives:

Tools: Executable functions
Resources: Structured data
Prompts: Instruction templates

Architecture Overview: How Everything Connects

Here's the complete system architecture showing how AWS Strands orchestrates multiple MCP servers through SAP GenAI Hub:

Walking Through the Architecture (4 Key Stages)

Stage 1: Enterprise User Request
Enterprise users interact with the AWS Strands Agent through SAP GenAI Hub, which provides the secure gateway to Anthropic's Claude models via Amazon Bedrock.

Stage 2: AI Agent Orchestration
The AWS Strands SDK handles multi-tool coordination. The MCP Client within Strands manages all communications with downstream servers, reasoning about which tools to invoke and when.

Stage 3: MCP Protocol Communications
The MCP Session Manager maintains persistent connections to all three specialized servers, aggregating 10+ financial tools into a unified interface. This eliminates connection overhead and provides seamless cross-server coordination.

Stage 4: Orchestrated Results
The system synthesizes data from all servers to produce comprehensive outputs: investment analysis reports, risk assessment matrices, sentiment analysis, and cross-server coordination reports.

Three Specialized MCP Servers (Demo Architecture)

We built three demo servers, each handling distinct financial intelligence capabilities:

Server	Port	Implementation	Purpose	Key Tools
Financial Data	8001	FastAPI (Manual)	Real-time market data	Stock quotes, fundamentals, health scoring
Document Analysis	8002	FastMCP Framework	Sentiment analysis	PDF parsing, report analysis, metric extraction
Analytics	8003	FastMCP Framework	Advanced analytics	Comparison charts, risk assessment, trend analysis

Why Two Approaches?

FastAPI (Manual Implementation):

Full control over JSON-RPC protocol
~150-200 lines for basic server
Deep MCP understanding required
Best for learning fundamentals

FastMCP Framework:

Automatic protocol handling
~50-75 lines for basic server
3-4x faster development
Production-ready features built-in

Both approaches demonstrate viable patterns. Your choice depends on control vs. velocity requirements.

Here's a FastMCP server example:

from fastmcp import FastMCP

mcp = FastMCP("document-analysis-server")

@mcp.tool()
def analyze_financial_report(content: str) -> dict:
    """Analyze financial text for sentiment and insights"""
    positive_keywords = ['growth', 'profit', 'strong', 'improved']
    negative_keywords = ['decline', 'loss', 'weak', 'reduced']

    # Sentiment analysis logic
    sentiment = calculate_sentiment(content, positive_keywords, negative_keywords)

    return {
        "sentiment": sentiment,
        "confidence_score": confidence,
        "key_findings": extract_findings(content),
        "identified_risks": identify_risks(content)
    }

if __name__ == "__main__":
    mcp.run(transport="http", host="127.0.0.1", port=8002)

The Session Manager Pattern

Managing connections to three MCP servers in our demo required persistent sessions without context manager complexity—as shown in Stage 3 of the architecture diagram.

The solution: A custom MCPSessionManager using Python's ExitStack:

from util.mcp_session_manager import MCPSessionManager

# Initialize manager
mcp_manager = MCPSessionManager()

# Establish persistent connections (Stage 3)
mcp_manager.start_sessions({
    "financial_data": "http://127.0.0.1:8001/mcp",
    "document_analysis": "http://127.0.0.1:8002/mcp",
    "analytics_reporting": "http://127.0.0.1:8003/mcp"
})

# Aggregate tools from all servers
all_tools = mcp_manager.get_all_tools()

# Create unified agent (Stage 2)
financial_agent = Agent(
    model=sap_genai_hub_model,
    tools=all_tools,
    system_prompt=financial_expert_prompt
)

This pattern eliminates boilerplate while demonstrating enterprise requirements: connection pooling, error recovery, and audit logging.

Demo Results: AWS + SAP Integration in Action

Following the architecture flow from Stage 1 → Stage 4, when a user asks: "Provide comprehensive investment analysis for SAP", the agent automatically:

Fetches stock data (Financial Server) → Current metrics
Analyzes sentiment (Document Server) → Report assessment
Calculates risk (Analytics Server) → Investment scoring
Synthesizes report (Stage 4 Outputs) → Executive-ready recommendation

No explicit orchestration. No hardcoded workflows. The agent reasons about tool usage and coordinates automatically across all three MCP servers.

Demo Performance Metrics

From our Devtoberfest proof-of-concept:

30% potential reduction in comprehensive financial analysis time
10-20% efficiency gains demonstrated for individual stock analysis
Automatic metrics tracking via AWS Strands observability
Production-ready monitoring patterns through CloudWatch integration

Enterprise Security: SAP GenAI Hub Integration (Stage 1)

The demo showcases how SAP Generative AI Hub provides critical governance when integrating with AWS:

✅ Content filtering on inputs and outputs
✅ Data masking for sensitive information
✅ Centralized policies across SAP ecosystem
✅ Compliance support for regulatory requirements

The Hub orchestrates access to Amazon Bedrock models (Claude 3.5, Titan) while maintaining security boundaries essential for enterprise deployments—all happening at Stage 1 of our architecture.

When to Use This Architecture Pattern

This demo architecture excels when you need to:

✅ Coordinate 3+ specialized systems or data sources
✅ Rapid prototyping with clear production path
✅ Model-driven flexibility over explicit workflows
✅ Standard protocols (MCP) for future extensibility

✅ Built-in observability for production monitoring
✅ Enterprise security and governance

What's Next: From Demo to Production

The demo system showcases integration possibilities:

SAP Integration: Connect MCP servers to SAP business processes
Multi-Tenant Deployments: Shared MCP infrastructure for multiple organizations
Hybrid Architectures: On-premises SAP + cloud-native AI services
Domain-Specific Agents: Specialized agents for procurement, finance, HR

Try It Yourself

Both notebooks are available in our GitHub repository. The progression from research agent to multi-server orchestration provides a practical learning path.

Key Takeaways:

Start Simple: Build single-agent systems first
Learn MCP: Understand the protocol fundamentals
Scale Thoughtfully: Use frameworks and patterns for production
Secure by Design: Implement proper auth, audit, monitoring
Observe Everything: Leverage built-in observability

Full Technical Deep-Dive + Video Tutorial

Want the complete implementation with detailed architecture walkthroughs, video tutorial, and production deployment guidance?

👉 Watch the video tutorial and read the full guide on MyITBasics

This includes:

Step-by-step video tutorial walking through the entire demo
Detailed MCP protocol implementation
AWS and SAP integration patterns
High-resolution architecture diagrams
Cost analysis and ROI calculations
AgentCore platform integration
Enterprise architecture considerations
Complete code samples and notebooks

Discussion Questions

I'd love to hear your experiences:

What challenges have you faced orchestrating multiple AI agents?
How do you approach AWS and SAP GenAI integration in your projects?
What's your strategy for securing enterprise AI integrations?

Abraham Arellano Tavara | Senior Solutions Architect, AWS Munich | LinkedIn

AWS #SAP #AIAgents #EnterpriseAI