<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abraham Arellano Tavara</title>
    <description>The latest articles on DEV Community by Abraham Arellano Tavara (@abraham_arellanotavara_7).</description>
    <link>https://dev.to/abraham_arellanotavara_7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3508825%2Fe10ccee5-1db3-42e9-b295-86218ee7a6ed.png</url>
      <title>DEV Community: Abraham Arellano Tavara</title>
      <link>https://dev.to/abraham_arellanotavara_7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/abraham_arellanotavara_7"/>
    <language>en</language>
    <item>
      <title>Choosing Between ML-KEM and ML-DSA for Your Post-Quantum Migration [Part 2]</title>
      <dc:creator>Abraham Arellano Tavara</dc:creator>
      <pubDate>Sun, 09 Nov 2025 15:50:13 +0000</pubDate>
      <link>https://dev.to/abraham_arellanotavara_7/choosing-between-ml-kem-and-ml-dsa-for-your-post-quantum-migration-part-2-4dip</link>
      <guid>https://dev.to/abraham_arellanotavara_7/choosing-between-ml-kem-and-ml-dsa-for-your-post-quantum-migration-part-2-4dip</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Post-Quantum Cryptography Migration Series:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/abraham_arellanotavara_7/the-quantum-threat-nobodys-taking-seriously-but-should-dh5"&gt;Part 1: The Quantum Threat&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2: ML-KEM vs ML-DSA&lt;/strong&gt; (You are here)&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Quick Recap from Part 1
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.tolink-to-part-1"&gt;Part 1&lt;/a&gt;, we established that the quantum threat isn't coming—it's already here through harvest-now-decrypt-later attacks. Adversaries are collecting encrypted data today to decrypt when quantum computers mature around 2030-2035.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The urgency:&lt;/strong&gt; If your data retention period + migration time &amp;gt; time until quantum computers, you've already run out of time to wait.&lt;/p&gt;

&lt;p&gt;Now comes the critical question: &lt;strong&gt;What do we actually migrate TO?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Question Every Architect Is Asking
&lt;/h2&gt;

&lt;p&gt;"Should we use ML-KEM or ML-DSA?"&lt;/p&gt;

&lt;p&gt;I've seen this question come up repeatedly in architecture discussions, and honestly, the confusion is understandable. The acronyms are overwhelming, and most documentation assumes you already know the difference.&lt;/p&gt;

&lt;p&gt;Here's the reality: &lt;strong&gt;you need both.&lt;/strong&gt; They solve fundamentally different problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happened in August 2024
&lt;/h2&gt;

&lt;p&gt;NIST finalized three post-quantum cryptography standards after &lt;strong&gt;8 years&lt;/strong&gt; of global scrutiny:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FIPS 203 (ML-KEM)&lt;/strong&gt;: Key encapsulation for establishing shared secrets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FIPS 204 (ML-DSA)&lt;/strong&gt;: Digital signatures for authentication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FIPS 205 (SLH-DSA)&lt;/strong&gt;: Hash-based backup signatures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't experimental. They're production-ready and already deployed by AWS, Google, and Cloudflare.&lt;/p&gt;

&lt;h2&gt;
  
  
  ML-KEM: Your TLS Handshakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it replaces:&lt;/strong&gt; RSA/ECDH key exchange&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where you'll use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TLS 1.3 connections (HTTPS, APIs)&lt;/li&gt;
&lt;li&gt;VPN tunnels (IPsec, OpenVPN)
&lt;/li&gt;
&lt;li&gt;SSH sessions&lt;/li&gt;
&lt;li&gt;Any key establishment protocol&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key sizes:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ML-KEM-768 (recommended):
  Public key: 1,184 bytes (vs. 32 bytes for X25519)
  Ciphertext: 1,088 bytes

Performance overhead: ~150 microseconds per handshake
With connection reuse: effectively 0%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real-world example (AWS KMS):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;KmsClient&lt;/span&gt; &lt;span class="n"&gt;kms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KmsClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;httpClient&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AwsCrtHttpClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;postQuantumTlsEnabled&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// ML-KEM-768 enabled&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benchmark results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0.05% throughput reduction with proper connection pooling&lt;/li&gt;
&lt;li&gt;0.3% latency increase on initial handshake&lt;/li&gt;
&lt;li&gt;Negligible impact with TLS reuse&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ML-DSA: Your Code Signing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it replaces:&lt;/strong&gt; RSA/ECDSA signatures&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where you'll use it:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Software distribution (binaries, containers)&lt;/li&gt;
&lt;li&gt;JWT/OAuth tokens&lt;/li&gt;
&lt;li&gt;Document signing&lt;/li&gt;
&lt;li&gt;Future TLS certificates (when CAs support it)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Signature sizes:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ML-DSA-65 (recommended):
  Public key: 1,952 bytes
  Signature: 3,309 bytes (vs. 64 bytes for ECDSA P-256)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The surprising part:&lt;/strong&gt;&lt;br&gt;
ML-DSA is actually &lt;strong&gt;10x faster&lt;/strong&gt; than RSA-2048 for signing operations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RSA-2048 signing: 2-5 milliseconds
ML-DSA-65 signing: 100-200 microseconds

Verification is similarly fast.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Critical Part: Hybrid Mode
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Never deploy pure post-quantum yet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hybrid combines classical + PQC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TLS handshake:
  1. ECDH key exchange (classical)
  2. ML-KEM-768 key exchange (PQC)
  3. Combined: KDF(ECDH_secret || ML-KEM_secret)

Security: Attacker must break BOTH to compromise
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ML-KEM was finalized only in 2024&lt;/li&gt;
&lt;li&gt;Implementation vulnerabilities might emerge&lt;/li&gt;
&lt;li&gt;Side-channel attacks could be discovered&lt;/li&gt;
&lt;li&gt;Hybrid provides insurance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Industry consensus:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NIST recommends a hybrid during transition&lt;/li&gt;
&lt;li&gt;NSA CNSA 2.0 allows hybrid through 2030&lt;/li&gt;
&lt;li&gt;IETF standardizing hybrid TLS specs&lt;/li&gt;
&lt;li&gt;AWS/Azure/GCP implement hybrid by default&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Decision Framework for Developers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For key exchange:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;95% of use cases → ML-KEM-768 (hybrid with X25519)
Government/NSS → ML-KEM-1024 (CNSA 2.0 requirement)
Future diversity → Plan for HQC (code-based, finalizing 2027)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;For signatures:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;General purpose → ML-DSA-65
Long-term archival → SLH-DSA-256 (hash-based, conservative)
Embedded/IoT → FN-DSA-512 (compact, when FIPS 206 finalizes)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance comparison:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr7jz0ni18jz6ffe7wu2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr7jz0ni18jz6ffe7wu2u.png" alt=" " width="800" height="550"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Migration Timeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why this is urgent:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"Harvest now, decrypt later" attacks are active today. Adversaries are collecting encrypted data to decrypt when quantum computers arrive (~2030-2035).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mosca's Theorem:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If: data_shelf_life + migration_time &amp;gt; time_until_quantum
Then: Start migration NOW
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For most enterprises with sensitive data, that equation already fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ❌ "We'll wait for better algorithms"
&lt;/h3&gt;

&lt;p&gt;These algorithms survived 8 years of attempted breaks. This IS the mature version.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ "We don't need hybrid"
&lt;/h3&gt;

&lt;p&gt;Even AWS, Google, and NIST recommend hybrid. Don't skip it.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ "Key sizes will break our protocols"
&lt;/h3&gt;

&lt;p&gt;1,184 bytes is manageable for modern networks. Packet fragmentation is handled by TLS.&lt;/p&gt;

&lt;h3&gt;
  
  
  ❌ "Performance will be terrible"
&lt;/h3&gt;

&lt;p&gt;With connection reuse, overhead is negligible. We've benchmarked it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt; Update your TLS libraries&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenSSL 3.2+ (experimental support)&lt;/li&gt;
&lt;li&gt;BoringSSL (Google's fork, deployed in Chrome)&lt;/li&gt;
&lt;li&gt;AWS-LC (FIPS validated, production-ready)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; Test in non-production&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# OpenSSH 9.9+ with ML-KEM&lt;/span&gt;
ssh &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;KexAlgorithms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;mlkem768x25519-sha256 user@host
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; Monitor performance&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Measure baseline (classical only)&lt;/li&gt;
&lt;li&gt;Enable hybrid mode&lt;/li&gt;
&lt;li&gt;Compare P50/P95/P99 latency&lt;/li&gt;
&lt;li&gt;Check for regressions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 4:&lt;/strong&gt; Gradual rollout&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Canary deployment (5%)&lt;/li&gt;
&lt;li&gt;Monitor for 1 week&lt;/li&gt;
&lt;li&gt;Expand to 25%, 50%, 100%&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;p&gt;I wrote a comprehensive deep-dive covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete algorithm comparison matrix&lt;/li&gt;
&lt;li&gt;Performance benchmarks&lt;/li&gt;
&lt;li&gt;Algorithm selection flowchart&lt;/li&gt;
&lt;li&gt;Hybrid deployment strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;📖 &lt;a href="https://myitbasics.com/post-quantum-algorithms-ml-kem-ml-dsa-guide/" rel="noopener noreferrer"&gt;Read the full guide: Post-Quantum Cryptography Algorithms Explained&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Also includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Downloadable algorithm selection flowchart (PDF)&lt;/li&gt;
&lt;li&gt;PQC migration checklist&lt;/li&gt;
&lt;li&gt;Links to NIST standards and AWS documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;The standards are ready. The implementations exist. Cloud providers are deploying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your move:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identify systems using RSA/ECDH today&lt;/li&gt;
&lt;li&gt;Update SDKs to versions supporting ML-KEM&lt;/li&gt;
&lt;li&gt;Test hybrid mode in staging&lt;/li&gt;
&lt;li&gt;Plan production rollout&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The quantum threat isn't theoretical—it's operational today through harvest-now-decrypt-later attacks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your migration strategy? Already testing ML-KEM, or still evaluating?&lt;/strong&gt; 👇&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next in This Series
&lt;/h2&gt;

&lt;p&gt;You now understand the threat (Part 1) and which algorithms to use (Part 2). In &lt;strong&gt;Part 3&lt;/strong&gt;, we'll get hands-on with AWS implementation—real code, performance benchmarks, and operational guidance for deploying ML-KEM in production.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 2 of a 6-part series on post-quantum cryptography migration. &lt;a href="https://dev.to/abraham_arellanotavara_7/the-quantum-threat-nobodys-taking-seriously-but-should-dh5"&gt;Read Part 1: The Quantum Threat&lt;/a&gt; if you haven't already.&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  cybersecurity #cryptography #quantum #devops
&lt;/h1&gt;

</description>
      <category>security</category>
      <category>cryptography</category>
      <category>devops</category>
      <category>architecture</category>
    </item>
    <item>
      <title>The Quantum Threat Nobody's Taking Seriously (But Should)</title>
      <dc:creator>Abraham Arellano Tavara</dc:creator>
      <pubDate>Sun, 02 Nov 2025 19:53:48 +0000</pubDate>
      <link>https://dev.to/abraham_arellanotavara_7/the-quantum-threat-nobodys-taking-seriously-but-should-dh5</link>
      <guid>https://dev.to/abraham_arellanotavara_7/the-quantum-threat-nobodys-taking-seriously-but-should-dh5</guid>
      <description>&lt;p&gt;&lt;em&gt;"We'll wait until quantum computers are actually here."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I hear this from security teams constantly. And every time, I cringe.&lt;/p&gt;

&lt;p&gt;Because they're missing the most dangerous part of the quantum threat: &lt;strong&gt;it's not coming—it's already here.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Attack That's Happening Right Now
&lt;/h2&gt;

&lt;p&gt;Adversaries aren't waiting for quantum computers to break your encryption. They're executing what's called &lt;strong&gt;"Harvest Now, Decrypt Later"&lt;/strong&gt; (HNDL) attacks—passively collecting your encrypted traffic today to decrypt in 2030-2035 when quantum computers mature.&lt;/p&gt;

&lt;p&gt;Your M&amp;amp;A negotiation emails from last month? Collected.&lt;/p&gt;

&lt;p&gt;Patient medical records from your healthcare system? Stored.&lt;/p&gt;

&lt;p&gt;Strategic defense communications? Archived.&lt;/p&gt;

&lt;p&gt;All waiting for Q-Day.&lt;/p&gt;

&lt;p&gt;The scary part? This is completely passive. No intrusion alerts. No failed login attempts. No evidence. Just silent collection of encrypted data that will become readable in a decade.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Math That Changes Everything
&lt;/h2&gt;

&lt;p&gt;Dr. Michele Mosca developed a simple formula that should terrify every security architect:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If X + Y &amp;gt; Z, you're at risk&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;X&lt;/strong&gt; = How long your data must stay secret&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Y&lt;/strong&gt; = How long migration takes
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Z&lt;/strong&gt; = Time until quantum computers arrive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's run this for a typical healthcare organization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;X = 30 years (HIPAA medical record retention)&lt;/li&gt;
&lt;li&gt;Y = 5 years (time to migrate complex systems)&lt;/li&gt;
&lt;li&gt;Z = 10 years (conservative quantum estimate)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;30 + 5 = 35 &amp;gt; 10&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They've already run out of time to wait.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Financial Reality
&lt;/h2&gt;

&lt;p&gt;According to IBM's 2024 Data Breach Report, the average healthcare breach costs &lt;strong&gt;$9.77 million&lt;/strong&gt;. But that's for breaches discovered today.&lt;/p&gt;

&lt;p&gt;What about the quantum liability? Consider 10 years of patient data being harvested right now, then decrypted in 2035. At $50,000 per HIPAA violation per record, a mid-size healthcare provider could be looking at &lt;strong&gt;hundreds of millions in potential liability&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And it's not just healthcare. Financial services process $500 billion daily. Government agencies hold state secrets that never expire. Even commercial enterprises have 5-10 year product roadmaps that competitors would pay millions to access.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Compliance Hammer
&lt;/h2&gt;

&lt;p&gt;The NSA's CNSA 2.0 isn't a suggestion—it's a mandate with hard deadlines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2025&lt;/strong&gt;: Software/firmware signing transition begins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2027&lt;/strong&gt;: New government systems must support post-quantum crypto&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2030&lt;/strong&gt;: VPNs, routers, firewalls must be compliant&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2035&lt;/strong&gt;: Complete quantum-resistant transition required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvlj1ns2w3srx9i3n5o2a.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvlj1ns2w3srx9i3n5o2a.jpg" alt=" " width="800" height="462"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're in government, defense, or their supply chain, you must comply or lose contracts. And those requirements cascade down through vendors and subcontractors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "Wait for Standards" Fails
&lt;/h2&gt;

&lt;p&gt;The most common response I hear: &lt;em&gt;"We'll wait until the standards mature."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's the problem with that strategy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standards ARE finalized.&lt;/strong&gt; NIST published FIPS 203, 204, and 205 in August 2024. The "wait for standards" excuse expired 18 months ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Migration takes 5-10 years.&lt;/strong&gt; This isn't a weekend deployment. It's discovery, planning, pilot programs, production rollout, and legacy system transitions. For complex enterprises, that's easily a decade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data is being harvested NOW.&lt;/strong&gt; Every day you wait is another day of encrypted traffic being collected for future decryption.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;This isn't about whether quantum computers will break RSA encryption. They will.&lt;/p&gt;

&lt;p&gt;It's not about whether post-quantum standards exist. They do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's about time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For most organizations with sensitive data, the calculation is clear: you need data to stay secret longer than the time you have before quantum computers arrive plus the time it takes to migrate.&lt;/p&gt;

&lt;p&gt;The question isn't whether to migrate to post-quantum cryptography. It's whether you'll start before or after your data gets harvested.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want the Full Analysis?
&lt;/h2&gt;

&lt;p&gt;I've written a comprehensive deep-dive covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete three-phase HNDL attack patterns and how they work&lt;/li&gt;
&lt;li&gt;Industry-specific risk calculations (healthcare, financial, government, enterprise)&lt;/li&gt;
&lt;li&gt;Detailed CNSA 2.0 compliance timeline with specific deadlines&lt;/li&gt;
&lt;li&gt;Why the $4.88M average breach cost dramatically underestimates quantum-era exposure&lt;/li&gt;
&lt;li&gt;Strategic migration frameworks and vendor dependency management&lt;/li&gt;
&lt;li&gt;What's actually vulnerable vs. safe in your current crypto stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Read the full article:&lt;/strong&gt; &lt;a href="https://myitbasics.com/quantum-threat-harvest-now-decrypt-later/" rel="noopener noreferrer"&gt;The Quantum Threat: Why "Harvest Now, Decrypt Later" Means Your Data Is Already at Risk&lt;/a&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>architecture</category>
      <category>quantum</category>
      <category>devops</category>
    </item>
    <item>
      <title>Why Your Authentication Architecture Is Your Biggest Security Blind Spot</title>
      <dc:creator>Abraham Arellano Tavara</dc:creator>
      <pubDate>Sat, 25 Oct 2025 18:20:56 +0000</pubDate>
      <link>https://dev.to/abraham_arellanotavara_7/why-your-authentication-architecture-is-your-biggest-security-blind-spot-2b3</link>
      <guid>https://dev.to/abraham_arellanotavara_7/why-your-authentication-architecture-is-your-biggest-security-blind-spot-2b3</guid>
      <description>&lt;p&gt;Every second, millions of authentication decisions are being made across global networks. Each one is a potential point of vulnerability—or a fortress of trust.&lt;/p&gt;

&lt;p&gt;After architecting authentication systems across diverse infrastructures for years, I've noticed something troubling: most technical teams focus on &lt;em&gt;implementing&lt;/em&gt; authentication methods while completely missing the architectural foundations that determine whether their systems will stand or fall under attack.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Authentication Paradox
&lt;/h2&gt;

&lt;p&gt;Here's the challenge that keeps security architects up at night: authentication seems deceptively simple at first. Verify the user is who they claim to be. Easy, right?&lt;/p&gt;

&lt;p&gt;But in practice, this spawns a complex web of technical decisions that ripple through every layer of your system. Like a medieval castle's defense system, modern authentication must protect multiple entry points while maintaining efficient access for legitimate users.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Pillars You Can't Ignore
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fza1c38c8oezedf0dd868.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fza1c38c8oezedf0dd868.webp" alt="Security 4 pillars" width="800" height="651"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagram above illustrates something critical that many architects overlook: authentication isn't just about the login screen. It's a complete architectural layer that touches every component of your system.&lt;/p&gt;

&lt;p&gt;Modern authentication architecture rests on four interconnected pillars:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validation Mechanisms&lt;/strong&gt; - Gone are the days of simple password checks. Today's systems orchestrate a sophisticated ballet of verification methods, from biometric validation to behavioral analysis. Each mechanism must work in concert, creating harmony between security and usability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security Boundaries&lt;/strong&gt; - Think of these as fortified vaults within vaults. Each boundary must protect its contents &lt;em&gt;and&lt;/em&gt; resist attacks on its own infrastructure. Minor boundary breaches can cascade into major security incidents—I've seen it happen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trust Management&lt;/strong&gt; - Creating and maintaining trust states is like diplomatic relations between nations. Initial trust must be established through rigorous verification, then maintained through continuous validation that adapts to changing conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure Handling&lt;/strong&gt; - Here's the counterintuitive part: how your system fails is as important as how it succeeds. Secure failure modes must prevent unauthorized access while maintaining availability and user experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Modern Threats Actually Look Like
&lt;/h2&gt;

&lt;p&gt;The authentication landscape in 2025 isn't what it was even two years ago. We're dealing with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distributed architecture complexity&lt;/strong&gt; where authentication must work seamlessly across microservices, multiple clouds, and hybrid environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sophisticated attack vectors&lt;/strong&gt; beyond simple password attacks—think credential stuffing, replay attacks, and AI-powered social engineering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The zero-trust imperative&lt;/strong&gt; where authentication serves as the new security perimeter, replacing outdated perimeter-based models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory evolution&lt;/strong&gt; with GDPR, CCPA, and industry-specific requirements demanding more robust mechanisms&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Architect's Dilemma
&lt;/h2&gt;

&lt;p&gt;From the architect's perspective, every authentication decision creates a cascade of implications:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Architecture&lt;/strong&gt; - Authentication requirements fundamentally shape your entire stack, from database design to API structures. These decisions ripple through every layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance at Scale&lt;/strong&gt; - Authentication sits in the critical path of user interactions. Every millisecond matters. Modern systems must balance robust security with lightning-fast performance through sophisticated caching and optimized cryptographic operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security Defense-in-Depth&lt;/strong&gt; - Like a medieval castle with multiple walls and moats, your authentication must implement layered security with multiple validation checkpoints and separated security contexts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scalability Engineering&lt;/strong&gt; - As systems grow, authentication must scale proportionally. This isn't just about handling more users—it's about maintaining security and performance under increasing load.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Vulnerabilities
&lt;/h2&gt;

&lt;p&gt;Here's what keeps me up at night: even the most robustly designed authentication systems can be vulnerable to subtle, sophisticated attack vectors that exploit their &lt;em&gt;physical implementation&lt;/em&gt; rather than their logical design.&lt;/p&gt;

&lt;p&gt;Side-channel attacks, timing analysis, cache behaviors, and microarchitectural vulnerabilities can all compromise authentication implementations in ways that traditional security testing completely misses.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Should Do Next
&lt;/h2&gt;

&lt;p&gt;If you're architecting authentication systems (or inheriting one), here are my immediate recommendations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audit your authentication boundaries&lt;/strong&gt; - Map out every trust boundary in your system and test for cascade failures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure your authentication latency&lt;/strong&gt; - If you're adding more than 50ms to user interactions, you need optimization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review your failure modes&lt;/strong&gt; - How does your system fail? Does it fail securely?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for scale&lt;/strong&gt; - Can your authentication system handle 10x your current load? 100x?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Deep Dive
&lt;/h2&gt;

&lt;p&gt;This barely scratches the surface of what modern authentication architecture demands. In my comprehensive guide, I break down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The technical intricacies of password-based authentication beyond simple storage&lt;/li&gt;
&lt;li&gt;How hardware tokens actually work at a cryptographic level&lt;/li&gt;
&lt;li&gt;Real-world implementation challenges and solutions from actual production systems&lt;/li&gt;
&lt;li&gt;Performance optimization techniques that maintain security&lt;/li&gt;
&lt;li&gt;The emerging world of biometric authentication and its architectural implications&lt;/li&gt;
&lt;li&gt;How side-channel attacks can compromise even well-designed systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://myitbasics.com/authentication-architecture/" rel="noopener noreferrer"&gt;Read the full technical deep dive on Authentication Architecture&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The guide includes detailed diagrams, code examples, and architectural patterns drawn from years of production experience. Whether you're building a new system or securing an existing one, understanding these foundations is crucial.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your Turn
&lt;/h2&gt;

&lt;p&gt;What authentication challenges are you facing in your architecture? Have you discovered any surprising vulnerabilities in your systems? Drop your experiences in the comments—I'd love to hear how other architects are tackling these problems.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Looking for more practical security architecture insights? Check out my blog at &lt;a href="https://myitbasics.com" rel="noopener noreferrer"&gt;myitbasics.com&lt;/a&gt; where I share technical deep dives on building secure, scalable systems.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>architecture</category>
      <category>authentication</category>
      <category>webdev</category>
    </item>
    <item>
      <title>After Asana's AI Breach: What It Takes to Deploy Production AI Agents Securely</title>
      <dc:creator>Abraham Arellano Tavara</dc:creator>
      <pubDate>Sat, 18 Oct 2025 19:53:10 +0000</pubDate>
      <link>https://dev.to/abraham_arellanotavara_7/after-asanas-ai-breach-what-it-takes-to-deploy-production-ai-agents-securely-2c84</link>
      <guid>https://dev.to/abraham_arellanotavara_7/after-asanas-ai-breach-what-it-takes-to-deploy-production-ai-agents-securely-2c84</guid>
      <description>&lt;p&gt;When Asana's Model Context Protocol server leaked data from ~1,000 organizations due to a session isolation flaw in May 2025, it crystallized a question I hear constantly from enterprise CTOs: &lt;strong&gt;"Can we actually deploy AI agents without creating the next security incident?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk93ns8k32hpxsw8v65rm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk93ns8k32hpxsw8v65rm.png" alt="Amazon Bedrock AgentCore Architecture" width="800" height="579"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After spending the past year deploying Amazon Bedrock AgentCore with customers across Europe—from 18-year-old SAP systems to regulated financial services—I've learned that moving AI agents from prototype to production isn't a framework problem. It's an &lt;strong&gt;infrastructure problem&lt;/strong&gt; that most teams discover too late.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-Month vs 6-Month Gap
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable pattern I see repeatedly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3 months:&lt;/strong&gt; Build an impressive AI agent demo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6 months:&lt;/strong&gt; Solve infrastructure problems you didn't know existed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap isn't about choosing LangChain vs CrewAI or Claude vs GPT. It's about challenges that traditional application architectures never required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 4 Infrastructure Problems That Kill Production Deployments
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Session Isolation (The Asana Problem)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The issue:&lt;/strong&gt; Traditional stateless functions terminate after each request. AI agents maintain complex state across multiple interactions—conversation history, tool permissions, intermediate computations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world impact:&lt;/strong&gt; Cross-tenant data contamination when one user's agent context bleeds into another's session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production solution:&lt;/strong&gt; Each user session requires its own dedicated microVM with isolated compute, memory, and filesystem resources. Complete termination after session completion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What actually happens in production
&lt;/span&gt;&lt;span class="n"&gt;runtime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;container_image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;protocol&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_CORE_RPC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memory_size_mb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;vcpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Each session gets its own isolated microVM
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;runtime_session_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-12345&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Isolated session
&lt;/span&gt;    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze Q4 financials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Long-Running Workflows
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The issue:&lt;/strong&gt; Research agents analyzing competitive intelligence or processing regulatory documents can't complete in Lambda's 15-minute window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world example:&lt;/strong&gt; A financial services agent analyzing SEC filings needs to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetch documents (5-10 min)&lt;/li&gt;
&lt;li&gt;Parse and extract data (15-20 min)&lt;/li&gt;
&lt;li&gt;Cross-reference with historical data (10-15 min)&lt;/li&gt;
&lt;li&gt;Generate compliance report (5-10 min)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total time:&lt;/strong&gt; 35-55 minutes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you need:&lt;/strong&gt; Agent sessions lasting up to 8 hours for multi-step agentic workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Identity Complexity
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The issue:&lt;/strong&gt; A single agent invocation might require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth authentication from the user&lt;/li&gt;
&lt;li&gt;IAM roles for AWS resources&lt;/li&gt;
&lt;li&gt;API keys for third-party services&lt;/li&gt;
&lt;li&gt;All while maintaining proper permission boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The gotcha I see constantly:&lt;/strong&gt; OAuth token expiration during long-running sessions manifests as tool invocation failures after 60-90 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production fix:&lt;/strong&gt; Implement token refresh logic in your middleware rather than relying on cached credentials.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Observability for Non-Deterministic Systems
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The challenge:&lt;/strong&gt; When an agent produces unexpected results, you need to trace not just &lt;em&gt;what&lt;/em&gt; happened, but &lt;em&gt;why&lt;/em&gt; the foundation model made specific reasoning decisions across potentially dozens of tool invocations.&lt;/p&gt;

&lt;p&gt;Traditional APM tools don't capture this level of detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  The SAP Integration Reality
&lt;/h2&gt;

&lt;p&gt;Here's a question from a recent architecture review: &lt;em&gt;"Can AgentCore connect to our SAP ECC 6.0 system?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The system: 18 years old, custom ABAP code, no REST APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is enterprise reality.&lt;/strong&gt; Most production systems weren't designed for modern API consumption.&lt;/p&gt;

&lt;p&gt;The pattern that works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# AgentCore Gateway + Lambda middleware pattern
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentcore&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Gateway&lt;/span&gt;

&lt;span class="n"&gt;sap_order_tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Gateway&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_order_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieve SAP order status using order number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lambda_function_arn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:lambda:eu-central-1:123456:function:sap-rfc-connector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;input_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Lambda function becomes your translation layer between the agent's expectations and SAP's proprietary RFC/BAPI protocols.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually fails in production:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network timeouts between Lambda and on-premises SAP&lt;/li&gt;
&lt;li&gt;OAuth token refresh during long sessions&lt;/li&gt;
&lt;li&gt;SAP-specific error codes that agents can't interpret&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Reality Check
&lt;/h2&gt;

&lt;p&gt;When a customer asked about costs for 1,000 conversations daily (5 messages each, 3 tool calls per message), here's what it looked like in Frankfurt region:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runtime&lt;/strong&gt; (2 vCPU, 4GB, 8-min avg sessions): ~$4,200/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gateway&lt;/strong&gt; (15,000 tool calls daily): ~$225/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt; (5,000 events daily): ~$375/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; (CloudWatch): ~$100/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total: ~$4,900/month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The comparison that matters:&lt;/strong&gt; Building equivalent infrastructure in-house requires a senior engineer (€90K annually = €7,500/month) for 3+ months of development, plus ongoing maintenance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Break-even point:&lt;/strong&gt; 3 months&lt;/p&gt;

&lt;h2&gt;
  
  
  When AgentCore Makes Sense
&lt;/h2&gt;

&lt;p&gt;✅ &lt;strong&gt;Yes, use it when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-tenant applications where session isolation is critical&lt;/li&gt;
&lt;li&gt;Regulated industries with audit requirements (finance, healthcare)&lt;/li&gt;
&lt;li&gt;Complex integrations across SAP, Salesforce, ServiceNow&lt;/li&gt;
&lt;li&gt;OAuth identity requirements where agents act on behalf of users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;❌ &lt;strong&gt;No, don't use it when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-frequency, sub-100ms latency requirements&lt;/li&gt;
&lt;li&gt;Simple automation tasks (single database queries)&lt;/li&gt;
&lt;li&gt;Budget constraints below $3-5K monthly&lt;/li&gt;
&lt;li&gt;You need complete infrastructure control&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Architecture Insight That Changed Everything
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AgentCore isn't competing with LangChain, CrewAI, or LlamaIndex.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AgentCore is the infrastructure those frameworks run on. Think &lt;strong&gt;Kubernetes for AI agents&lt;/strong&gt;—you bring your framework and model, AgentCore provides production-grade runtime, security, and operational tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  GDPR Reality for European Markets
&lt;/h2&gt;

&lt;p&gt;The critical gotcha I've seen catch multiple organizations:&lt;/p&gt;

&lt;p&gt;AgentCore Memory supports both short-term event retention and long-term storage. &lt;strong&gt;By default, long-term memory persists indefinitely.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You must configure time-to-live policies to comply with GDPR's right to erasure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My recommendation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;90-day retention for short-term memory&lt;/li&gt;
&lt;li&gt;Explicit deletion workflows for long-term storage&lt;/li&gt;
&lt;li&gt;Deploy in Frankfurt region (eu-central-1) for data residency&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The "Start Simple" Pattern That Works
&lt;/h2&gt;

&lt;p&gt;Based on successful deployments:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1-3:&lt;/strong&gt; Prototype in free tier (until Sep 16, 2025)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build agent using your preferred framework&lt;/li&gt;
&lt;li&gt;Deploy to AgentCore Runtime&lt;/li&gt;
&lt;li&gt;Integrate 2-3 tools through Gateway&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 4-10:&lt;/strong&gt; Pilot with 100-500 users&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor costs and observability&lt;/li&gt;
&lt;li&gt;Refine tool integrations&lt;/li&gt;
&lt;li&gt;Gather user feedback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 11+:&lt;/strong&gt; Production rollout&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with one use case&lt;/li&gt;
&lt;li&gt;Expand based on ROI&lt;/li&gt;
&lt;li&gt;Implement memory strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The teams that struggled:&lt;/strong&gt; Tried to migrate entire application portfolios at once without understanding cost implications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Session isolation isn't optional&lt;/strong&gt; for multi-tenant agents. The Asana incident demonstrated what happens when isolation fails.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Integration complexity compounds quickly.&lt;/strong&gt; Every additional backend system adds authentication layers, error handling, and monitoring. Gateway's automatic conversion eliminates months of work.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Production agents require production infrastructure.&lt;/strong&gt; Memory management, observability, and identity controls aren't features you add later—they're foundational.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Discussion Questions
&lt;/h2&gt;

&lt;p&gt;I'd love to hear your perspective:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Have you deployed AI agents in production? What infrastructure challenges surprised you?&lt;/li&gt;
&lt;li&gt;For those running SAP or legacy systems—how are you handling integration?&lt;/li&gt;
&lt;li&gt;What's your biggest concern: security, cost, or complexity?&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Full technical deep-dive (with code examples, architecture diagrams, and cost breakdowns):&lt;/strong&gt;&lt;br&gt;
👉 &lt;a href="https://myitbasics.com/deploy-ai-agents-production-aws-agentcore/" rel="noopener noreferrer"&gt;Read the complete guide on MyITBasics&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete SAP integration architecture with authentication flows&lt;/li&gt;
&lt;li&gt;Regional deployment strategies for GDPR compliance&lt;/li&gt;
&lt;li&gt;Debugging common production issues&lt;/li&gt;
&lt;li&gt;Implementation quickstart with Dockerfile&lt;/li&gt;
&lt;li&gt;All 7 AgentCore services explained in detail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Abraham Arellano Tavara | Senior Strategic Solutions Architect, AWS Munich&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>machinelearning</category>
      <category>security</category>
      <category>cloudcomputing</category>
    </item>
    <item>
      <title>I Tested GPU Time-Slicing With Real LLMs So You Don't Have To 🚀</title>
      <dc:creator>Abraham Arellano Tavara</dc:creator>
      <pubDate>Mon, 29 Sep 2025 19:11:58 +0000</pubDate>
      <link>https://dev.to/abraham_arellanotavara_7/i-tested-gpu-time-slicing-with-real-llms-so-you-dont-have-to-2n9d</link>
      <guid>https://dev.to/abraham_arellanotavara_7/i-tested-gpu-time-slicing-with-real-llms-so-you-dont-have-to-2n9d</guid>
      <description>&lt;h2&gt;
  
  
  I Tested GPU Time-Slicing With Real LLMs So You Don't Have To 🚀
&lt;/h2&gt;

&lt;h2&gt;
  
  
  🎯 TL;DR - The Numbers Don't Lie
&lt;/h2&gt;

&lt;p&gt;I spent a week testing NVIDIA time-slicing on AWS EKS with &lt;strong&gt;real LLM workloads&lt;/strong&gt; (not toy examples). Here's what actually happens:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Time-slicing overhead&lt;/strong&gt;: Only ~1% (NVIDIA crushed this)&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Concurrent workloads&lt;/strong&gt;: 50-100% performance degradation (physics can't be cheated)&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;Cost savings&lt;/strong&gt;: 50% reduction for sequential workloads&lt;/li&gt;
&lt;li&gt;🎯 &lt;strong&gt;Best use&lt;/strong&gt;: Dev/test environments, time-shifted workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom line&lt;/strong&gt;: Time-slicing is brilliant for isolation, terrible for concurrent performance.&lt;/p&gt;

&lt;p&gt;📦 &lt;strong&gt;Full code, configs, and test scripts&lt;/strong&gt;: &lt;a href="https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔑 Quick Reference - Key Terms
&lt;/h2&gt;

&lt;p&gt;Before we dive deep, here's your decoder ring:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;th&gt;Why You Care&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time-Slicing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPU virtualization creating multiple virtual GPUs from one physical GPU&lt;/td&gt;
&lt;td&gt;Lets multiple apps share a GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OOM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Out Of Memory - when GPU runs out of VRAM&lt;/td&gt;
&lt;td&gt;Your pods crash mysteriously&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TGI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Text Generation Inference - HuggingFace's LLM serving engine&lt;/td&gt;
&lt;td&gt;Industry standard for serving models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple workloads running simultaneously&lt;/td&gt;
&lt;td&gt;Where performance degradation happens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sequential&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Workloads running one after another&lt;/td&gt;
&lt;td&gt;Where time-slicing shines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  💸 The $500 Question That Started This
&lt;/h2&gt;

&lt;p&gt;Picture this: You're running two LLM models in production. That's &lt;strong&gt;$2/hour&lt;/strong&gt; for two GPU instances. Over a month, that's &lt;strong&gt;$1,440&lt;/strong&gt;. Your CFO is asking why the GPU bill is so high.&lt;/p&gt;

&lt;p&gt;Then someone mentions NVIDIA time-slicing: "Just share one GPU between both models!"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The question everyone asks&lt;/strong&gt;: Does this actually work without destroying performance?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer everyone gives&lt;/strong&gt;: &lt;em&gt;"It depends..."&lt;/em&gt; (not helpful)&lt;/p&gt;

&lt;p&gt;So I decided to test it with &lt;strong&gt;real production workloads&lt;/strong&gt; and actual performance measurement. No toy examples. No theoretical benchmarks. Just two real LLMs hammering a shared GPU.&lt;/p&gt;

&lt;p&gt;Spoiler: The results surprised me.&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗️ The Test Lab Setup
&lt;/h2&gt;

&lt;p&gt;Here's what I built for this experiment:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmnw4myoha1tymw9m44r8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmnw4myoha1tymw9m44r8.png" alt="Test Lab Setup" width="800" height="479"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  🎮 The Hardware
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU&lt;/strong&gt;: NVIDIA L40S (46GB VRAM) - The new hotness&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance&lt;/strong&gt;: g6e.2xlarge (~$1.01/hour in us-west-2)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: Much cheaper than p3.8xlarge ($12.24/hour)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes&lt;/strong&gt;: EKS 1.32 with NVIDIA GPU Operator&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  🤖 The Contenders
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Model A: Microsoft Phi-3.5-mini-instruct&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Size: ~4GB memory footprint&lt;/li&gt;
&lt;li&gt;Speed: Fast inference (&amp;lt; 1 second)&lt;/li&gt;
&lt;li&gt;Use case: Quick responses, high throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Model B: DeepSeek-R1-Distill-Llama-8B&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Size: ~8GB memory footprint
&lt;/li&gt;
&lt;li&gt;Speed: Slower but more thoughtful (~1 second)&lt;/li&gt;
&lt;li&gt;Use case: Complex reasoning, detailed outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Both running&lt;/strong&gt;: HuggingFace Text Generation Inference (TGI) 3.3.4&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Why these models?&lt;/strong&gt; They represent real production workloads - different sizes, different performance profiles, and combined they use ~12GB (26% of available 46GB).&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🔥 The 3 Mistakes I Made (So You Don't Have To)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Mistake #1: "GPUs Just Work™" (They Don't)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I expected&lt;/strong&gt;: Spin up g6e.2xlarge, GPU drivers already installed (like p3 instances)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually happened&lt;/strong&gt;: No GPU detected. Pods stuck in &lt;code&gt;Pending&lt;/code&gt;. Panic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod
&lt;span class="c"&gt;# Events: 0/1 nodes available: insufficient nvidia.com/gpu&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The plot twist&lt;/strong&gt;: Unlike p3 instances, g6e.2xlarge doesn't come with pre-installed NVIDIA drivers in EKS managed node groups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix that saved the day&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# NVIDIA GPU Operator does ALL the heavy lifting&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;gpu-operator nvidia/gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; nodeSelector.eks-node&lt;span class="o"&gt;=&lt;/span&gt;gpu &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This magical operator automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Installs NVIDIA drivers&lt;/li&gt;
&lt;li&gt;✅ Configures container toolkit
&lt;/li&gt;
&lt;li&gt;✅ Deploys device plugin&lt;/li&gt;
&lt;li&gt;✅ Sets up GPU feature discovery&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro tip&lt;/strong&gt;: Always use GPU Operator for modern EKS setups. Manual driver installation is pain.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Mistake #2: "Just Deploy Both Models" (OOM Speedrun)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I tried&lt;/strong&gt;: Deploy both models with default settings&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happened&lt;/strong&gt;: Both pods started... then crashed with cryptic errors&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RuntimeError: CUDA out of memory. Tried to allocate 20.00 GiB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The problem&lt;/strong&gt;: Each model tried to grab ~80% of GPU memory. Math doesn't work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model A: 80% × 46GB = 36.8GB&lt;/li&gt;
&lt;li&gt;Model B: 80% × 46GB = 36.8GB
&lt;/li&gt;
&lt;li&gt;Total needed: 73.6GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Available: 46GB&lt;/strong&gt; ❌&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Aggressive memory limits per model&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cuda-memory-fraction"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.4"&lt;/span&gt;  &lt;span class="c1"&gt;# 🎯 Only use 40% GPU memory per model&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--max-batch-prefill-tokens"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4096"&lt;/span&gt;  &lt;span class="c1"&gt;# ⚠️ Reduced from default 8192&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--max-input-length"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256"&lt;/span&gt;  &lt;span class="c1"&gt;# 🔒 Limit input size&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--max-total-tokens"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512"&lt;/span&gt;  &lt;span class="c1"&gt;# 🔒 Limit output size&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The math that works&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model A: 40% × 46GB = 18.4GB ✅&lt;/li&gt;
&lt;li&gt;Model B: 40% × 46GB = 18.4GB ✅&lt;/li&gt;
&lt;li&gt;Total: 36.8GB (80% utilization) ✅&lt;/li&gt;
&lt;li&gt;System overhead: 20% buffer ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;🚨 &lt;strong&gt;Critical setting&lt;/strong&gt;: Without &lt;code&gt;cuda-memory-fraction&lt;/code&gt;, models will OOM during warmup. This isn't optional!&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Mistake #3: "Time-Slicing Config Is Obvious" (It's Not)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What the docs say&lt;/strong&gt;: Create a ConfigMap&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What they don't say&lt;/strong&gt;: You need TWO ConfigMaps and an operator upgrade&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The complete configuration&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ConfigMap 1: Time-slicing configuration&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;time-slicing-config&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-operator&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
    &lt;span class="s"&gt;version: v1&lt;/span&gt;
    &lt;span class="s"&gt;sharing:&lt;/span&gt;
      &lt;span class="s"&gt;timeSlicing:&lt;/span&gt;
        &lt;span class="s"&gt;resources:&lt;/span&gt;
        &lt;span class="s"&gt;- name: nvidia.com/gpu&lt;/span&gt;
          &lt;span class="s"&gt;replicas: 10  # 🎯 10 virtual GPUs from 1 physical&lt;/span&gt;

&lt;span class="s"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# ConfigMap 2: Device plugin config&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;device-plugin-config&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-operator&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
    &lt;span class="s"&gt;version: v1&lt;/span&gt;
    &lt;span class="s"&gt;flags:&lt;/span&gt;
      &lt;span class="s"&gt;migStrategy: none&lt;/span&gt;
    &lt;span class="s"&gt;sharing:&lt;/span&gt;
      &lt;span class="s"&gt;timeSlicing:&lt;/span&gt;
        &lt;span class="s"&gt;renameByDefault: false&lt;/span&gt;
        &lt;span class="s"&gt;failRequestsGreaterThanOne: false&lt;/span&gt;
        &lt;span class="s"&gt;resources:&lt;/span&gt;
        &lt;span class="s"&gt;- name: nvidia.com/gpu&lt;/span&gt;
          &lt;span class="s"&gt;replicas: 10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Then upgrade the operator&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm upgrade gpu-operator nvidia/gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; devicePlugin.config.name&lt;span class="o"&gt;=&lt;/span&gt;device-plugin-config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify it worked&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe node &amp;lt;gpu-node&amp;gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;nvidia.com/gpu

&lt;span class="c"&gt;# Before:  nvidia.com/gpu: 1  ❌&lt;/span&gt;
&lt;span class="c"&gt;# After:   nvidia.com/gpu: 10 ✅&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;🎉 &lt;strong&gt;Success&lt;/strong&gt;: Your cluster now advertises 10 virtual GPUs instead of 1!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;What this means&lt;/strong&gt;: You can now schedule 10 pods requesting &lt;code&gt;nvidia.com/gpu: 1&lt;/code&gt; on a single physical GPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 The Results (Prepare to Be Surprised)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Test Scenario 1: Individual Performance (No Competition)
&lt;/h3&gt;

&lt;p&gt;First, I tested each model alone with time-slicing enabled. &lt;strong&gt;Would time-slicing itself add overhead?&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Phi-3.5-Mini Flying Solo
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Success Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time-sliced GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.609s&lt;/td&gt;
&lt;td&gt;98.44 req/min&lt;/td&gt;
&lt;td&gt;100% ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exclusive GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.603s&lt;/td&gt;
&lt;td&gt;99.46 req/min&lt;/td&gt;
&lt;td&gt;100% ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+0.006s&lt;/td&gt;
&lt;td&gt;-1.02 req/min&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Overhead: ~1%&lt;/strong&gt; 🎉&lt;/p&gt;

&lt;h4&gt;
  
  
  DeepSeek-R1 Flying Solo
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Success Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time-sliced GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.135s&lt;/td&gt;
&lt;td&gt;52.84 req/min&lt;/td&gt;
&lt;td&gt;100% ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exclusive GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.142s&lt;/td&gt;
&lt;td&gt;52.49 req/min&lt;/td&gt;
&lt;td&gt;100% ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-0.007s&lt;/td&gt;
&lt;td&gt;+0.35 req/min&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Overhead: ~1%&lt;/strong&gt; (actually slightly faster!) 🤯&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Key Insight #1&lt;/strong&gt;: NVIDIA time-slicing overhead is &lt;strong&gt;negligible&lt;/strong&gt;. The virtualization layer is incredibly efficient. This is exceptional engineering.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Test Scenario 2: Concurrent Performance (The Real Test)
&lt;/h3&gt;

&lt;p&gt;Now both models hitting the GPU &lt;strong&gt;simultaneously&lt;/strong&gt;. Every request from both models at the same time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is where reality hits.&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Phi-3.5-Mini Under Fire
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;Concurrent&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.609s&lt;/td&gt;
&lt;td&gt;1.227s&lt;/td&gt;
&lt;td&gt;🔴 &lt;strong&gt;+101.4%&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;98.44 req/min&lt;/td&gt;
&lt;td&gt;48.89 req/min&lt;/td&gt;
&lt;td&gt;🔴 &lt;strong&gt;-50.3%&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Success Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;✅ Still stable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  DeepSeek-R1 Under Fire
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;Concurrent&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.135s&lt;/td&gt;
&lt;td&gt;1.778s&lt;/td&gt;
&lt;td&gt;🔴 &lt;strong&gt;+56.6%&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;52.84 req/min&lt;/td&gt;
&lt;td&gt;33.74 req/min&lt;/td&gt;
&lt;td&gt;🔴 &lt;strong&gt;-36.1%&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Success Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;✅ Still stable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;🚨 &lt;strong&gt;Key Insight #2&lt;/strong&gt;: Resource competition is BRUTAL. When both models compete for the same GPU, performance tanks by 50-100%.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  📈 Visual Performance Comparison
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Individual Performance (Time-Slicing Overhead)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Exclusive GPU:    ████████████████████ 100%
Time-Sliced GPU:  ███████████████████░ 99%
                  ↑ Only 1% difference!

Concurrent Performance (Resource Competition)  
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Baseline:         ████████████████████ 100%
Concurrent:       ██████████░░░░░░░░░░ 50%
                  ↑ Ouch. Physics can't be cheated.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  🤔 Why This Happens (The Physics)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Time-slicing overhead (~1%)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Context switching is fast&lt;/li&gt;
&lt;li&gt;✅ Memory isolation is efficient
&lt;/li&gt;
&lt;li&gt;✅ Scheduling overhead is minimal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Resource competition (50-100% degradation)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Both models fight for GPU cores&lt;/li&gt;
&lt;li&gt;❌ Memory bandwidth saturation&lt;/li&gt;
&lt;li&gt;❌ L2 cache thrashing&lt;/li&gt;
&lt;li&gt;❌ Shared memory contention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The verdict&lt;/strong&gt;: Time-slicing technology is brilliant. GPU resource sharing is expensive.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 The Decision Framework (Should YOU Use Time-Slicing?)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ Perfect Use Cases - Deploy With Confidence
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Development &amp;amp; Testing Environments&lt;/strong&gt; 🧪&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: QA team needs to test 3 model versions
Cost without time-slicing: $3/hour (3 GPUs)
Cost with time-slicing: $1/hour (1 GPU)
Savings: $1,440/month
Performance impact: None (sequential testing)
Verdict: Slam dunk ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Time-Shifted Workloads&lt;/strong&gt; ⏰&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Model A (business hours), Model B (batch processing at night)
Overlap: &amp;lt; 10% of time
Performance: 99% (negligible overhead when not competing)
Savings: 50% GPU costs
Verdict: Perfect fit ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Demo &amp;amp; POC Deployments&lt;/strong&gt; 🎬&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Sales demo with multiple model comparisons
Requirements: Not production, occasional use
Budget: Limited
Performance needs: "Good enough"
Verdict: Ideal use case ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. CI/CD Model Testing&lt;/strong&gt; 🔄&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Automated model validation pipelines
Pattern: Sequential test runs
Peak load: One test at a time
Cost optimization: Critical
Verdict: Great match ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  ❌ Terrible Use Cases - Avoid These
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Production Inference Serving&lt;/strong&gt; 💼&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Customer-facing API with SLA requirements
Requirement: &amp;lt; 100ms response time
Concurrent load: Unpredictable spikes
Impact: 50-100% degradation = SLA violations
Verdict: Don't even think about it ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. High-Throughput Concurrent Workloads&lt;/strong&gt; 🚀&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Multiple models serving real-time traffic
Load pattern: Constant concurrent requests
Performance impact: Immediate 50% throughput loss
Business impact: Lost revenue, poor UX
Verdict: Hard pass ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Latency-Sensitive Applications&lt;/strong&gt; ⚡&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Real-time chat, autocomplete, voice assistants
SLA: Sub-second responses required
Concurrent degradation: Doubles latency
User impact: Frustrated users, high churn
Verdict: Nope ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Auto-Scaling Production Workloads&lt;/strong&gt; 📈&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Traffic scales unpredictably
Problem: Can't predict when models compete
Risk: Performance collapse during peak times
Business impact: Revenue loss during high-traffic
Verdict: Too risky ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  🤔 Decision Tree - Find Your Path
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Start Here
    │
    ├─ Is this production? ─── YES ──→ Will workloads overlap?
    │                                       │
    │                                       ├─ YES ──→ ❌ Don't use time-slicing
    │                                       │
    │                                       └─ NO ───→ ✅ Consider time-slicing
    │
    └─ NO (Dev/Test) ─────────────────────→ ✅ Use time-slicing
                                                 (perfect use case!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  💰 ROI Calculator - Your Break-Even Analysis
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Without Time-Slicing&lt;/th&gt;
&lt;th&gt;With Time-Slicing&lt;/th&gt;
&lt;th&gt;Monthly Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2 Models, Sequential&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1,440&lt;/td&gt;
&lt;td&gt;$720&lt;/td&gt;
&lt;td&gt;$720 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2 Models, 30% Overlap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1,440&lt;/td&gt;
&lt;td&gt;$720&lt;/td&gt;
&lt;td&gt;$720 (but some degradation) ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2 Models, 50% Overlap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1,440&lt;/td&gt;
&lt;td&gt;$720&lt;/td&gt;
&lt;td&gt;$720 (significant degradation) ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2 Models, Always Concurrent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1,440&lt;/td&gt;
&lt;td&gt;$720&lt;/td&gt;
&lt;td&gt;Not worth it ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Break-even point&lt;/strong&gt;: If your workloads overlap &amp;lt; 30% of the time, time-slicing typically provides net positive value.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip&lt;/strong&gt;: Monitor actual workload overlap in production before deciding. Use CloudWatch metrics to track GPU utilization patterns.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🧪 How I Tested This (Reproducible Science)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Testing Strategy
&lt;/h3&gt;

&lt;p&gt;I built an automated framework to eliminate human error and ensure reproducible results:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test Protocol&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;☝️ Test each model individually (establish baseline)&lt;/li&gt;
&lt;li&gt;✌️ Test both models concurrently (measure degradation)&lt;/li&gt;
&lt;li&gt;🔁 Repeat 3 times with 5 different prompts (45 requests total)&lt;/li&gt;
&lt;li&gt;📊 Calculate statistical averages and impact percentages&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Automation Script
&lt;/h3&gt;

&lt;p&gt;Here's the core testing logic (simplified):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# Complete performance testing framework&lt;/span&gt;

test_individual_model&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
    &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;

    &lt;span class="c"&gt;# Test prompts covering different complexity levels&lt;/span&gt;
    &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;prompts&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;
        &lt;span class="s2"&gt;"Explain machine learning"&lt;/span&gt;
        &lt;span class="s2"&gt;"What is Python programming"&lt;/span&gt;
        &lt;span class="s2"&gt;"Describe cloud computing"&lt;/span&gt;
        &lt;span class="s2"&gt;"How does AI work"&lt;/span&gt;
        &lt;span class="s2"&gt;"What are automation benefits"&lt;/span&gt;
    &lt;span class="o"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;# Run 3 iterations for statistical accuracy&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;iteration &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;1 3&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
        for &lt;/span&gt;prompt &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
            &lt;span class="c"&gt;# Measure with millisecond precision&lt;/span&gt;
            &lt;span class="nv"&gt;start_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s.%N&lt;span class="si"&gt;)&lt;/span&gt;

            &lt;span class="nv"&gt;response&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$endpoint&lt;/span&gt;&lt;span class="s2"&gt;/generate"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
                &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
                &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"{
                    &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;inputs&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="nv"&gt;$prompt&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,
                    &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;parameters&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: {
                        &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;max_new_tokens&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: 50,
                        &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;temperature&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: 0.7
                    }
                }"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

            &lt;span class="nv"&gt;end_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s.%N&lt;span class="si"&gt;)&lt;/span&gt;
            &lt;span class="nv"&gt;duration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$end_time&lt;/span&gt;&lt;span class="s2"&gt; - &lt;/span&gt;&lt;span class="nv"&gt;$start_time&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | bc&lt;span class="si"&gt;)&lt;/span&gt;

            &lt;span class="c"&gt;# Record results&lt;/span&gt;
            &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$duration&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;model_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_results.txt"&lt;/span&gt;
        &lt;span class="k"&gt;done
    done&lt;/span&gt;

    &lt;span class="c"&gt;# Calculate statistics&lt;/span&gt;
    calculate_stats &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;model_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_results.txt"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

test_concurrent_models&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;# Fire both requests simultaneously using background jobs&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;prompt &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;prompts&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
        &lt;span class="c"&gt;# Model A request&lt;/span&gt;
        &lt;span class="o"&gt;{&lt;/span&gt;
            measure_latency &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PHI35_ENDPOINT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$prompt&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; phi_concurrent.txt
        &lt;span class="o"&gt;}&lt;/span&gt; &amp;amp;

        &lt;span class="c"&gt;# Model B request  &lt;/span&gt;
        &lt;span class="o"&gt;{&lt;/span&gt;
            measure_latency &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DEEPSEEK_ENDPOINT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$prompt&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; deepseek_concurrent.txt
        &lt;span class="o"&gt;}&lt;/span&gt; &amp;amp;

        &lt;span class="c"&gt;# Wait for both to complete&lt;/span&gt;
        &lt;span class="nb"&gt;wait
    &lt;/span&gt;&lt;span class="k"&gt;done&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Kubernetes Scaling for Test Control
&lt;/h3&gt;

&lt;p&gt;The genius part: Using Kubernetes to control test scenarios:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test Phi-3.5 alone&lt;/span&gt;
kubectl scale deployment deepseek-r1-baseline &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing
&lt;span class="c"&gt;# Wait 30 seconds for graceful shutdown&lt;/span&gt;
./load_test.sh

&lt;span class="c"&gt;# Test DeepSeek alone&lt;/span&gt;
kubectl scale deployment mistral-7b-baseline &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing
kubectl scale deployment deepseek-r1-baseline &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing
&lt;span class="c"&gt;# Wait 30 seconds for startup&lt;/span&gt;
./load_test.sh

&lt;span class="c"&gt;# Test both concurrently&lt;/span&gt;
kubectl scale deployment mistral-7b-baseline &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing
&lt;span class="c"&gt;# Wait 30 seconds for startup&lt;/span&gt;
./load_test.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Why this works&lt;/strong&gt;: Scaling deployments ensures clean test isolation without manual intervention or pod management.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What Made This Scientific
&lt;/h3&gt;

&lt;p&gt;✅ &lt;strong&gt;Controlled environment&lt;/strong&gt;: No other GPU workloads running&lt;br&gt;
✅ &lt;strong&gt;Multiple iterations&lt;/strong&gt;: 3 runs × 5 prompts = statistical validity&lt;br&gt;
✅ &lt;strong&gt;Standardized prompts&lt;/strong&gt;: Same inputs across all tests&lt;br&gt;
✅ &lt;strong&gt;Consistent parameters&lt;/strong&gt;: Same token limits, temperature&lt;br&gt;
✅ &lt;strong&gt;Automated execution&lt;/strong&gt;: Eliminates human timing errors&lt;br&gt;
✅ &lt;strong&gt;Millisecond precision&lt;/strong&gt;: Accurate latency measurement&lt;/p&gt;
&lt;h3&gt;
  
  
  Sample Output
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;===&lt;/span&gt; Phi-3.5-Mini &lt;span class="o"&gt;(&lt;/span&gt;Individual Baseline&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt;
Total Requests: 15
Successful: 15 &lt;span class="o"&gt;(&lt;/span&gt;100%&lt;span class="o"&gt;)&lt;/span&gt;
Average Latency: 0.609s
Throughput: 98.44 req/min

&lt;span class="o"&gt;===&lt;/span&gt; Phi-3.5-Mini &lt;span class="o"&gt;(&lt;/span&gt;Concurrent&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt;
Average Latency: 1.227s &lt;span class="o"&gt;(&lt;/span&gt;+101.4% 🔴&lt;span class="o"&gt;)&lt;/span&gt;
Throughput: 48.89 req/min &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nt"&gt;-50&lt;/span&gt;.3% 🔴&lt;span class="o"&gt;)&lt;/span&gt;

Report saved: test_results/GPU_SLICING_FULL_performance_report_20250725_095710.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;📦 &lt;strong&gt;Get the complete testing framework&lt;/strong&gt;: &lt;a href="https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  💰 The Money Talk - Real ROI Analysis
&lt;/h2&gt;

&lt;p&gt;Let's talk dollars and cents. Because at the end of the day, your CFO cares about the bottom line.&lt;/p&gt;
&lt;h3&gt;
  
  
  Scenario 1: Traditional Approach (Separate GPUs)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────┐
│  Model A: g6e.2xlarge           │
│  Cost: $1.01/hour               │
│  Performance: 100% ✅            │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│  Model B: g6e.2xlarge           │
│  Cost: $1.01/hour               │
│  Performance: 100% ✅            │
└─────────────────────────────────┘

Total: $2.02/hour = $1,454/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 2: Time-Slicing (Sequential Workloads)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────┐
│  Single g6e.2xlarge             │
│                                 │
│  Model A (9am-5pm)  ──────┐    │
│  Model B (6pm-8am)  ──────┤    │
│                                 │
│  Cost: $1.01/hour               │
│  Performance: 99% ✅             │
└─────────────────────────────────┘

Total: $1.01/hour = $727/month
Savings: $727/month (50% reduction! 🎉)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;When this works&lt;/strong&gt;: Workloads naturally time-shifted (batch processing, different timezones, dev/staging)&lt;/p&gt;


&lt;h3&gt;
  
  
  Scenario 3: Time-Slicing (Concurrent Workloads)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────┐
│  Single g6e.2xlarge             │
│                                 │
│  Model A + Model B (competing)  │
│                                 │
│  Cost: $1.01/hour               │
│  Performance: 50% ⚠️             │
└─────────────────────────────────┘

Total: $1.01/hour = $727/month
Savings: $727/month
Trade-off: 50% performance loss 💀
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;When this fails&lt;/strong&gt;: Production inference, customer-facing APIs, latency-sensitive applications&lt;/p&gt;


&lt;h3&gt;
  
  
  The Financial Break-Even Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload Overlap&lt;/th&gt;
&lt;th&gt;Cost Savings&lt;/th&gt;
&lt;th&gt;Performance&lt;/th&gt;
&lt;th&gt;Recommended?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;0-10%&lt;/strong&gt; (mostly sequential)&lt;/td&gt;
&lt;td&gt;50% ✅&lt;/td&gt;
&lt;td&gt;99% ✅&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Yes&lt;/strong&gt; 🎯&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;10-30%&lt;/strong&gt; (occasional overlap)&lt;/td&gt;
&lt;td&gt;50% ✅&lt;/td&gt;
&lt;td&gt;80-90% ⚠️&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Maybe&lt;/strong&gt; 🤔&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;30-50%&lt;/strong&gt; (frequent overlap)&lt;/td&gt;
&lt;td&gt;50% ✅&lt;/td&gt;
&lt;td&gt;60-80% ⚠️&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Risky&lt;/strong&gt; 😬&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;50%+&lt;/strong&gt; (mostly concurrent)&lt;/td&gt;
&lt;td&gt;50% ❌&lt;/td&gt;
&lt;td&gt;50% ❌&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;No&lt;/strong&gt; 🚫&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  Real-World Cost Example (My Consulting Client)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Their Setup&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dev environment: 2 models for A/B testing&lt;/li&gt;
&lt;li&gt;Usage pattern: Sequential (test Model A, then Model B)&lt;/li&gt;
&lt;li&gt;Previous cost: $1,440/month (2 GPUs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After Time-Slicing&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New cost: $720/month (1 GPU)&lt;/li&gt;
&lt;li&gt;Performance: 99% (negligible overhead)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings: $8,640/year&lt;/strong&gt; 💰&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;CFO's reaction&lt;/strong&gt;: "Why weren't we doing this before?"&lt;/p&gt;


&lt;h3&gt;
  
  
  The Hidden Costs of Getting It Wrong
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Mistake&lt;/strong&gt;: Using time-slicing for production inference&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: E-commerce chatbot with strict SLA (&amp;lt; 500ms response)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before time-slicing:
Response time: 400ms ✅
Conversion rate: 12% ✅
Revenue impact: $0

After time-slicing (concurrent load):
Response time: 800ms ❌ (SLA breach)
Conversion rate: 8% ❌ (users bounce)
Revenue impact: -$50,000/month 💀
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: The $720/month GPU savings cost them $50,000/month in revenue. Not worth it.&lt;/p&gt;




&lt;h3&gt;
  
  
  Your ROI Decision Tree
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Question 1: Are your workloads production-facing?
    │
    ├─ NO ──→ Question 2: Do workloads overlap?
    │           │
    │           ├─ NO ──→ ✅ Use time-slicing (50% savings!)
    │           │
    │           └─ YES ──→ ⚠️ Prototype and measure first
    │
    └─ YES ──→ Question 3: Can you tolerate 50% performance loss?
                │
                ├─ NO ──→ ❌ Don't use time-slicing
                │
                └─ YES ──→ 🤔 Are you SURE? Measure twice, deploy once.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Pro Tip&lt;/strong&gt;: Always prototype with time-slicing in staging before production. Measure actual performance impact with YOUR workloads, not theoretical benchmarks.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🚀 Quick Start - Get Running in 30 Minutes
&lt;/h2&gt;

&lt;p&gt;Want to try this yourself? Here's the exact path I followed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites Check ✅
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify you have these tools installed&lt;/span&gt;
kubectl version &lt;span class="nt"&gt;--client&lt;/span&gt;
helm version
eksctl version
aws &lt;span class="nt"&gt;--version&lt;/span&gt;

&lt;span class="c"&gt;# If any are missing, install from:&lt;/span&gt;
&lt;span class="c"&gt;# kubectl: https://kubernetes.io/docs/tasks/tools/&lt;/span&gt;
&lt;span class="c"&gt;# helm: https://helm.sh/docs/intro/install/&lt;/span&gt;
&lt;span class="c"&gt;# eksctl: https://eksctl.io/installation/&lt;/span&gt;
&lt;span class="c"&gt;# aws: https://aws.amazon.com/cli/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 1: Create EKS Cluster (15 minutes)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create cluster configuration file&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;' &amp;gt; cluster-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: gpusharing-demo
  region: us-west-2
  version: "1.32"
nodeGroups:
  - name: main
    instanceType: t3.large
    desiredCapacity: 2
    minSize: 2
    maxSize: 4
  - name: gpu
    instanceType: g6e.2xlarge
    desiredCapacity: 1
    minSize: 1
    maxSize: 1
    labels:
      eks-node: gpu
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Create the cluster (takes ~15 minutes)&lt;/span&gt;
eksctl create cluster &lt;span class="nt"&gt;-f&lt;/span&gt; cluster-config.yaml

&lt;span class="c"&gt;# Verify nodes are ready&lt;/span&gt;
kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What you'll see&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                         STATUS   ROLE    AGE
ip-192-168-1-1...            Ready    &amp;lt;none&amp;gt;  5m    # t3.large
ip-192-168-1-2...            Ready    &amp;lt;none&amp;gt;  5m    # t3.large  
ip-192-168-1-3...            Ready    &amp;lt;none&amp;gt;  5m    # g6e.2xlarge (GPU!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 2: Install NVIDIA GPU Operator (5 minutes)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add NVIDIA Helm repository&lt;/span&gt;
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

&lt;span class="c"&gt;# Install GPU Operator (this does ALL the heavy lifting)&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;gpu-operator nvidia/gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; nodeSelector.eks-node&lt;span class="o"&gt;=&lt;/span&gt;gpu &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;

&lt;span class="c"&gt;# Verify installation (all pods should be Running)&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Wait for all pods to show &lt;code&gt;1/1 Running&lt;/code&gt;&lt;/strong&gt; (takes 2-3 minutes)&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 3: Enable Time-Slicing (3 minutes)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download complete configuration&lt;/span&gt;
wget https://raw.githubusercontent.com/AbrahamArellano/eks-shared-gpu-ai-performance/main/infra/time-slicing-config.yaml

&lt;span class="c"&gt;# Apply time-slicing configuration&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; time-slicing-config.yaml

&lt;span class="c"&gt;# Upgrade GPU operator with time-slicing&lt;/span&gt;
helm upgrade gpu-operator nvidia/gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; devicePlugin.config.name&lt;span class="o"&gt;=&lt;/span&gt;device-plugin-config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verify it worked&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe node &lt;span class="si"&gt;$(&lt;/span&gt;kubectl get nodes &lt;span class="nt"&gt;-l&lt;/span&gt; eks-node&lt;span class="o"&gt;=&lt;/span&gt;gpu &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.items[0].metadata.name}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"nvidia.com/gpu:"&lt;/span&gt;

&lt;span class="c"&gt;# Expected output:&lt;/span&gt;
&lt;span class="c"&gt;#  nvidia.com/gpu:     10  ✅ (not 1!)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 4: Deploy Your Models (5 minutes)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create namespace&lt;/span&gt;
kubectl create namespace llm-testing

&lt;span class="c"&gt;# Clone the complete repository&lt;/span&gt;
git clone https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance.git
&lt;span class="nb"&gt;cd &lt;/span&gt;eks-shared-gpu-ai-performance

&lt;span class="c"&gt;# Deploy both models with memory-optimized configs&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; models/mistral-memory-optimized.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; models/deepseek-memory-optimized.yaml

&lt;span class="c"&gt;# Watch pods start (takes 2-3 minutes to download models)&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing &lt;span class="nt"&gt;-w&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Wait for both pods to show &lt;code&gt;1/1 Running&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 5: Run Performance Tests (2 minutes)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Port forward to access models locally&lt;/span&gt;
kubectl port-forward svc/mistral-7b-service 8081:8080 &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing &amp;amp;
kubectl port-forward svc/deepseek-r1-service 8082:8080 &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing &amp;amp;

&lt;span class="c"&gt;# Run the complete test suite&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;tests
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x load_test.sh
./load_test.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Output you'll see&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== Complete GPU Time-Slicing Performance Analysis ===
Testing Phi-3.5-Mini (Individual Baseline)...
  ✓ Test 1: 0.610s
  ✓ Test 2: 0.602s
  ...

Testing DeepSeek-R1 (Individual Baseline)...
  ✓ Test 1: 1.142s
  ...

Testing Both Models Concurrently...
  ✓ Both completed
  ...

Report saved: test_results/performance_report_YYYYMMDD_HHMMSS.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 6: View Your Results
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View the latest report&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;tests/test_results/performance_report_&lt;span class="k"&gt;*&lt;/span&gt;.txt | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll see something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== Phi-3.5-Mini Individual Baseline ===
Average Latency: 0.609s
Throughput: 98.44 req/min

=== Phi-3.5-Mini Concurrent Performance ===
Average Latency: 1.227s
Performance Impact: +101.4% latency 🔴
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  🎉 Success! You've Now:
&lt;/h3&gt;

&lt;p&gt;✅ Created an EKS cluster with GPU support&lt;br&gt;
✅ Enabled NVIDIA time-slicing (10 virtual GPUs)&lt;br&gt;
✅ Deployed two real LLM models&lt;br&gt;
✅ Measured actual performance impact&lt;br&gt;
✅ Generated comprehensive performance reports&lt;/p&gt;


&lt;h3&gt;
  
  
  Cleanup (Don't Forget!)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Delete the entire cluster to avoid charges&lt;/span&gt;
eksctl delete cluster gpusharing-demo &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2

&lt;span class="c"&gt;# Verify deletion&lt;/span&gt;
aws eks list-clusters &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Important&lt;/strong&gt;: Running this setup costs ~$1.20/hour. Don't forget to delete when done!&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h3&gt;
  
  
  Troubleshooting Common Issues
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Pods stuck in &lt;code&gt;Pending&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if GPU is detected&lt;/span&gt;
kubectl describe node &amp;lt;gpu-node&amp;gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;nvidia.com/gpu

&lt;span class="c"&gt;# If shows 0, restart device plugin&lt;/span&gt;
kubectl rollout restart daemonset/nvidia-device-plugin-daemonset &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Models crash with OOM&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check cuda-memory-fraction in deployment&lt;/span&gt;
kubectl describe deployment mistral-7b-baseline &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing

&lt;span class="c"&gt;# Should see: --cuda-memory-fraction 0.4&lt;/span&gt;
&lt;span class="c"&gt;# If not, update the YAML and reapply&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Can't access models via port-forward&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if services exist&lt;/span&gt;
kubectl get svc &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing

&lt;span class="c"&gt;# Check if pods are ready&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing

&lt;span class="c"&gt;# Restart port-forward&lt;/span&gt;
pkill &lt;span class="nt"&gt;-f&lt;/span&gt; port-forward
kubectl port-forward svc/mistral-7b-service 8081:8080 &lt;span class="nt"&gt;-n&lt;/span&gt; llm-testing &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  📚 Next Steps
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Experiment&lt;/strong&gt;: Try different models from HuggingFace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize&lt;/strong&gt;: Tune memory fractions for your workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor&lt;/strong&gt;: Set up CloudWatch for GPU metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt;: Add more GPU nodes if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Complete implementation guide&lt;/strong&gt;: &lt;a href="https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 5 Things I Wish I Knew Before Starting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. "Pre-installed Drivers" Doesn't Mean What You Think
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I assumed&lt;/strong&gt;: g6e instances come with NVIDIA drivers like p3 instances&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reality check&lt;/strong&gt;: Spent 2 hours debugging why pods couldn't see the GPU&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson&lt;/strong&gt;: Always use GPU Operator for modern EKS setups. It's not optional—it's essential.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time saved for you&lt;/strong&gt;: 2 hours of confusion 😅&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Memory Limits Are Not Suggestions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What I did first&lt;/strong&gt;: Deployed models with default settings&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happened&lt;/strong&gt;: Both models tried to grab 80% of GPU memory each&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The crash&lt;/strong&gt;: &lt;code&gt;CUDA out of memory&lt;/code&gt; errors everywhere&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: &lt;code&gt;cuda-memory-fraction: 0.4&lt;/code&gt; is your best friend&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: In GPU sharing, aggressive memory limits aren't pessimistic—they're realistic.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Time-Slicing ≠ Magic Performance Multiplier
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Marketing says&lt;/strong&gt;: "Share one GPU across multiple workloads!"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reality says&lt;/strong&gt;: "Share one GPU across multiple workloads... but not at full speed concurrently"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The truth&lt;/strong&gt;: Time-slicing provides isolation, not performance multiplication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mental model&lt;/strong&gt;: Think of it like time-sharing a CPU, not adding more cores.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Test Sequential Before Assuming Concurrent
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;My mistake&lt;/strong&gt;: Assumed concurrent workloads would work "well enough"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The numbers&lt;/strong&gt;: 50-100% performance degradation&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The learning&lt;/strong&gt;: Always measure YOUR workloads with YOUR patterns&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro tip&lt;/strong&gt;: Use Kubernetes scaling to isolate test scenarios cleanly&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Production ≠ Development (Obvious, But...)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Development&lt;/strong&gt;: Time-slicing is perfect&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost savings? Yes ✅&lt;/li&gt;
&lt;li&gt;Performance trade-offs? Acceptable ✅
&lt;/li&gt;
&lt;li&gt;Stability? Excellent ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Production&lt;/strong&gt;: Time-slicing is risky&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SLA requirements? Violated ❌&lt;/li&gt;
&lt;li&gt;Unpredictable performance? Dangerous ❌&lt;/li&gt;
&lt;li&gt;Customer experience? Compromised ❌&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The rule&lt;/strong&gt;: If it touches paying customers, provision separate GPUs.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎬 The Verdict - Should You Use Time-Slicing?
&lt;/h2&gt;

&lt;p&gt;After a week of testing, thousands of inference requests, and countless hours of analysis, here's my honest take:&lt;/p&gt;

&lt;h3&gt;
  
  
  ✅ Time-Slicing Is Brilliant For:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Development environments&lt;/strong&gt; where cost matters more than peak performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential workloads&lt;/strong&gt; with natural time-shifting patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A/B testing&lt;/strong&gt; where models don't compete simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POC/Demo environments&lt;/strong&gt; with flexible requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning and experimentation&lt;/strong&gt; without breaking the bank&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ROI&lt;/strong&gt;: 50% cost savings with 99% performance ✅&lt;/p&gt;




&lt;h3&gt;
  
  
  ❌ Time-Slicing Is Terrible For:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production inference&lt;/strong&gt; serving customer traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent workloads&lt;/strong&gt; with strict SLA requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency-sensitive applications&lt;/strong&gt; where milliseconds matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Revenue-generating systems&lt;/strong&gt; where performance = money&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-scaling workloads&lt;/strong&gt; with unpredictable patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Risk&lt;/strong&gt;: 50-100% performance degradation = unhappy customers ❌&lt;/p&gt;




&lt;h3&gt;
  
  
  The Technology Itself? 🏆 A+ Engineering
&lt;/h3&gt;

&lt;p&gt;NVIDIA absolutely crushed the implementation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only ~1% overhead from time-slicing mechanism&lt;/li&gt;
&lt;li&gt;Rock-solid stability (zero crashes in extensive testing)&lt;/li&gt;
&lt;li&gt;Clean Kubernetes integration&lt;/li&gt;
&lt;li&gt;Production-grade reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The performance degradation comes from physics, not technology.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can't cheat the fundamental limitations of shared resources. Time-slicing doesn't create more GPU compute—it manages access to existing compute.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Your Next Steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  If You're Convinced (Dev/Test Use Case):
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;⭐ &lt;strong&gt;Star the repo&lt;/strong&gt;: &lt;a href="https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔧 &lt;strong&gt;Follow the Quick Start&lt;/strong&gt;: 30 minutes to working setup&lt;/li&gt;
&lt;li&gt;📊 &lt;strong&gt;Run your own tests&lt;/strong&gt;: Measure YOUR workloads&lt;/li&gt;
&lt;li&gt;💰 &lt;strong&gt;Calculate YOUR ROI&lt;/strong&gt;: Use the decision framework&lt;/li&gt;
&lt;li&gt;🎉 &lt;strong&gt;Deploy and save money&lt;/strong&gt;: Start with dev environments&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  If You're Skeptical (Production Use Case):
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;✅ &lt;strong&gt;Provision separate GPUs&lt;/strong&gt;: Safety first&lt;/li&gt;
&lt;li&gt;🧪 &lt;strong&gt;Test time-slicing in staging&lt;/strong&gt;: Validate with real traffic patterns&lt;/li&gt;
&lt;li&gt;📈 &lt;strong&gt;Monitor overlap patterns&lt;/strong&gt;: Measure actual concurrent load&lt;/li&gt;
&lt;li&gt;🤔 &lt;strong&gt;Reconsider for off-peak&lt;/strong&gt;: Maybe time-slice during low-traffic hours?&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  If You're Curious (Learning Mode):
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;📖 &lt;strong&gt;Read the full guide&lt;/strong&gt;: &lt;a href="https://myitbasics.com/gpu-sharing-amazon-eks" rel="noopener noreferrer"&gt;Complete blog post&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🎓 &lt;strong&gt;Understand the concepts&lt;/strong&gt;: Time-slicing vs MIG vs MPS&lt;/li&gt;
&lt;li&gt;🛠️ &lt;strong&gt;Experiment safely&lt;/strong&gt;: Use the provided test framework&lt;/li&gt;
&lt;li&gt;💬 &lt;strong&gt;Share your findings&lt;/strong&gt;: Comment below with your results&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  📚 Complete Resource Library
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Code &amp;amp; Configuration
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;📦 &lt;strong&gt;GitHub Repository&lt;/strong&gt;: &lt;a href="https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance" rel="noopener noreferrer"&gt;eks-shared-gpu-ai-performance&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;Complete Kubernetes manifests&lt;/li&gt;
&lt;li&gt;Automated testing framework&lt;/li&gt;
&lt;li&gt;Performance analysis scripts&lt;/li&gt;
&lt;li&gt;Troubleshooting guides&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deep Dive Content
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;📝 &lt;strong&gt;Full Technical Analysis&lt;/strong&gt;: &lt;a href="https://myitbasics.com/gpu-sharing-amazon-eks" rel="noopener noreferrer"&gt;MyITBasics.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🏗️ &lt;strong&gt;Architecture Patterns&lt;/strong&gt;: Complete infrastructure setup guide&lt;/li&gt;
&lt;li&gt;🔍 &lt;strong&gt;Performance Analysis&lt;/strong&gt;: Detailed metrics and methodology&lt;/li&gt;
&lt;li&gt;💡 &lt;strong&gt;Best Practices&lt;/strong&gt;: Production-ready recommendations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  💬 Let's Discuss - Your Turn!
&lt;/h2&gt;

&lt;p&gt;I've shared my findings. Now I want to hear yours:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💭 Questions for the community:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have you used GPU time-slicing in production? What was your experience?&lt;/li&gt;
&lt;li&gt;What workload patterns are you trying to optimize?&lt;/li&gt;
&lt;li&gt;Any other GPU sharing strategies you've found effective?&lt;/li&gt;
&lt;li&gt;Found bugs or improvements in my testing methodology?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🐛 Found an issue in the code?&lt;/strong&gt;&lt;br&gt;
Open an issue or PR on &lt;a href="https://github.com/AbrahamArellano/eks-shared-gpu-ai-performance" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💡 Want to discuss your specific use case?&lt;/strong&gt;&lt;br&gt;
Drop a comment below—I read and respond to all of them!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📧 Need consulting help?&lt;/strong&gt;&lt;br&gt;
Visit &lt;a href="https://myitbasics.com" rel="noopener noreferrer"&gt;MyITBasics.com&lt;/a&gt; for architecture guidance&lt;/p&gt;




&lt;h2&gt;
  
  
  🙏 Thanks for Reading!
&lt;/h2&gt;

&lt;p&gt;If you found this helpful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⭐ &lt;strong&gt;Star the GitHub repo&lt;/strong&gt; to bookmark for later&lt;/li&gt;
&lt;li&gt;💬 &lt;strong&gt;Comment below&lt;/strong&gt; with your experiences or questions&lt;/li&gt;
&lt;li&gt;🔄 &lt;strong&gt;Share this post&lt;/strong&gt; with your team&lt;/li&gt;
&lt;li&gt;👤 &lt;strong&gt;Follow me&lt;/strong&gt; for more deep-dives into GPU architecture, AI infrastructure, and cloud-native engineering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Coming up next&lt;/strong&gt;: Multi-GPU strategies, MIG vs time-slicing comparison, and cost optimization techniques for production AI workloads.&lt;/p&gt;

&lt;p&gt;Stay tuned! 🚀&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with curiosity, tested with rigor, shared with the community.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;— Abraham Arellano&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Cloud Architect &amp;amp; AI Infrastructure Engineer&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://myitbasics.com" rel="noopener noreferrer"&gt;MyITBasics.com&lt;/a&gt; | &lt;a href="https://github.com/AbrahamArellano" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>kubernetes</category>
      <category>gpu</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Clinical AI Engineering: Building Production-Ready Healthcare NLP Infrastructure</title>
      <dc:creator>Abraham Arellano Tavara</dc:creator>
      <pubDate>Sun, 14 Sep 2025 20:11:24 +0000</pubDate>
      <link>https://dev.to/abraham_arellanotavara_7/clinical-ai-engineering-building-production-ready-healthcare-nlp-infrastructure-2g54</link>
      <guid>https://dev.to/abraham_arellanotavara_7/clinical-ai-engineering-building-production-ready-healthcare-nlp-infrastructure-2g54</guid>
      <description>&lt;p&gt;Ever wondered what happens when you try to reproduce a healthcare AI research paper? We discovered that you end up building significantly more infrastructure than initially expected! &lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: Research vs. Reality
&lt;/h2&gt;

&lt;p&gt;My colleague &lt;a href="https://github.com/umeshkumar235" rel="noopener noreferrer"&gt;Umesh Kumar&lt;/a&gt; and I set out to reproduce &lt;a href="https://arxiv.org/abs/2302.08091" rel="noopener noreferrer"&gt;"Do We Still Need Clinical Language Models?"&lt;/a&gt; for our UIUC Master's course Deep Learning for Healthcare. What started as a simple validation project turned into a deep dive into &lt;strong&gt;production-ready healthcare NLP infrastructure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The core question seemed straightforward:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Do specialized clinical models (BioClinicalBERT) still outperform general models (RoBERTa, T5) on medical NLP tasks?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But implementing a system to reliably answer this across &lt;strong&gt;3 clinical tasks&lt;/strong&gt;, &lt;strong&gt;multiple model architectures&lt;/strong&gt;, and &lt;strong&gt;25,000+ text samples&lt;/strong&gt; revealed the massive gap between research papers and production systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Built 🏗️
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Clinical NLP Battleground
&lt;/h3&gt;

&lt;p&gt;We evaluated models across three real-world healthcare tasks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;Real-World Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MedNLI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Medical reasoning&lt;/td&gt;
&lt;td&gt;Clinical decision support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RadQA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Information extraction&lt;/td&gt;
&lt;td&gt;Finding answers in medical records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CLIP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-label classification&lt;/td&gt;
&lt;td&gt;Routing patient communications&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fer1vigwuesf6hy9b454g.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fer1vigwuesf6hy9b454g.webp" alt="Clinical NLP Data Pipeline Architecture" width="800" height="698"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Infrastructure Reality Check
&lt;/h3&gt;

&lt;p&gt;Here's what the papers don't tell you about building clinical NLP systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PhysioNet credentialing&lt;/strong&gt; for each dataset (regulatory compliance is real!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory management&lt;/strong&gt; across different model architectures &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic batch sizing&lt;/strong&gt; to prevent OOM crashes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixed precision training&lt;/strong&gt; on Tesla T4 GPUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration management&lt;/strong&gt; for systematic hyperparameter exploration&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Findings That Matter 📊
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Fine-Tuning Still Wins (By A Lot)
&lt;/h3&gt;

&lt;p&gt;BioClinicalBERT Performance:&lt;br&gt;
├── Fine-tuned: 0.793 accuracy (MedNLI)&lt;br&gt;
└── In-Context Learning: 0.374 accuracy&lt;/p&gt;

&lt;p&gt;The hype around prompt-based learning? &lt;strong&gt;Our findings suggest it needs more development for clinical tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Task-Specific Model Selection
&lt;/h3&gt;

&lt;p&gt;Models that performed excellently on medical reasoning didn't automatically excel at information extraction. &lt;strong&gt;One size doesn't fit all&lt;/strong&gt; in healthcare AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Production Efficiency Insights
&lt;/h3&gt;

&lt;p&gt;Clinical models like BioClinicalBERT needed &lt;strong&gt;fewer training epochs&lt;/strong&gt; to reach optimal performance compared to adapted general models. This translates to real cost savings in production!&lt;/p&gt;

&lt;h2&gt;
  
  
  The Engineering Deep Dive 🔧
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Modular Architecture That Actually Works
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Clean separation of concerns
&lt;/span&gt;&lt;span class="n"&gt;clinical_tasks&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;mednli&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;          &lt;span class="c1"&gt;# Medical reasoning
&lt;/span&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;radqa&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;           &lt;span class="c1"&gt;# Question answering  
&lt;/span&gt;&lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="n"&gt;clip&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;            &lt;span class="c1"&gt;# Multi-label classification
&lt;/span&gt;&lt;span class="err"&gt;└──&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;          &lt;span class="c1"&gt;# Common infrastructure
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configuration-Driven Everything
&lt;/h3&gt;

&lt;p&gt;YAML configs that handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model-specific parameters&lt;/li&gt;
&lt;li&gt;Task-specific preprocessing&lt;/li&gt;
&lt;li&gt;Environment-aware resource management&lt;/li&gt;
&lt;li&gt;Automatic batch size adjustment&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Error Handling for the Real World
&lt;/h3&gt;

&lt;p&gt;Because healthcare AI can't just crash when it hits an edge case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Graceful OOM recovery&lt;/li&gt;
&lt;li&gt;Comprehensive logging&lt;/li&gt;
&lt;li&gt;Resource monitoring&lt;/li&gt;
&lt;li&gt;Validation safeguards&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters for Healthcare AI 🎯
&lt;/h2&gt;

&lt;p&gt;This isn't just another research reproduction. We're talking about:&lt;br&gt;
✅ Reproducible research infrastructure that others can build on&lt;br&gt;
✅ Production-ready patterns for healthcare AI teams&lt;br&gt;
✅ Open-source implementation advancing the community&lt;br&gt;
✅ Regulatory-compliant data handling approaches&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Specialized clinical models still matter. General models aren't ready to replace domain-specific healthcare AI, especially when accuracy can impact patient care.&lt;/p&gt;

&lt;p&gt;But more importantly: the gap between research and production in healthcare AI is huge. Building bridges requires thinking about infrastructure, compliance, efficiency, and maintainability from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Want the Full Technical Deep Dive?
&lt;/h2&gt;

&lt;p&gt;I've written a comprehensive breakdown covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detailed architecture decisions&lt;/li&gt;
&lt;li&gt;Performance benchmarking across all models&lt;/li&gt;
&lt;li&gt;Computational efficiency analysis
&lt;/li&gt;
&lt;li&gt;Production deployment guidance&lt;/li&gt;
&lt;li&gt;Complete open-source implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://myitbasics.com/clinical-ai-engineering-building-production-ready-healthcare-nlp-infrastructure/" rel="noopener noreferrer"&gt;Read the full article: Clinical AI Engineering - Building Production-Ready Healthcare NLP Infrastructure&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔗 &lt;a href="https://github.com/AbrahamArellano/UIUC-DL4H-Clinical-LLM-Evaluation" rel="noopener noreferrer"&gt;Check out the complete implementation on GitHub&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What's your experience with healthcare AI in production? Have you faced similar challenges bridging research and deployment? Drop your thoughts in the comments! 👇&lt;/p&gt;

&lt;h1&gt;
  
  
  HealthcareAI #ClinicalNLP #MachineLearning #ProductionAI
&lt;/h1&gt;

</description>
      <category>machinelearning</category>
      <category>healthcare</category>
      <category>python</category>
      <category>deeplearning</category>
    </item>
    <item>
      <title>How to build a Multi-Agent Financial Intelligence with AWS and SAP</title>
      <dc:creator>Abraham Arellano Tavara</dc:creator>
      <pubDate>Fri, 24 Jan 2025 19:31:43 +0000</pubDate>
      <link>https://dev.to/abraham_arellanotavara_7/how-to-build-a-multi-agent-financial-intelligence-with-aws-and-sap-1ic9</link>
      <guid>https://dev.to/abraham_arellanotavara_7/how-to-build-a-multi-agent-financial-intelligence-with-aws-and-sap-1ic9</guid>
      <description>&lt;p&gt;Three days. That's what it took to build a sophisticated financial intelligence demo orchestrating three specialized MCP servers using AWS Strands and SAP Generative AI Hub. The result? &lt;a href="https://myitbasics.com/how-build-agent-orchestration-ai-systems-aws-sap/" rel="noopener noreferrer"&gt;A complete demo for SAP TechEd&lt;/a&gt; to showcase a &lt;strong&gt;30% potential reduction in financial analysis time&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not because building agentic systems is trivial, but because integrating AWS and SAP's generative AI stacks with the right architectural decisions makes complex demo scenarios tractable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: Demonstrating Enterprise AI Integration
&lt;/h2&gt;

&lt;p&gt;Most AI agent tutorials showcase simple, single-tool agents. But demonstrating enterprise-grade AWS and SAP integration requires more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multiple data sources&lt;/strong&gt; requiring specialized processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-system coordination&lt;/strong&gt; without hardcoded workflows
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-grade patterns&lt;/strong&gt; and governance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observable, maintainable&lt;/strong&gt; architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When creating our Devtoberfest session on building multi-tool research agents, we wanted to demonstrate &lt;em&gt;real&lt;/em&gt; enterprise integration patterns—showcasing how SAP's Generative AI Hub connects with AWS Bedrock through the AWS Strands SDK.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Foundation: Research Agent with AWS Strands
&lt;/h2&gt;

&lt;p&gt;We started with a deep research agent demo using the &lt;a href="https://aws.amazon.com/blogs/opensource/introducing-strands-agents-an-open-source-ai-agents-sdk/" rel="noopener noreferrer"&gt;AWS Strands Agents SDK&lt;/a&gt; and Tavily API for web intelligence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time_range&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Search the web and return ranked results&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tavily_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;time_range&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;time_range&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;format_search_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;  
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;web_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract full page content from URLs&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tavily_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;web_crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Crawl websites and discover nested links&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tavily_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create the agent
&lt;/span&gt;&lt;span class="n"&gt;deep_researcher_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bedrock_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;RESEARCH_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;web_extract&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;web_crawl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format_research_response&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What makes AWS Strands different?&lt;/strong&gt; It's model-driven, not workflow-driven. You provide tools and a system prompt—the LLM handles planning, reasoning, and orchestration. This shifts complexity from code into the model's weights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built-in Production Observability
&lt;/h3&gt;

&lt;p&gt;AWS Strands automatically tracks critical metrics using OpenTelemetry:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric Category&lt;/th&gt;
&lt;th&gt;What It Tracks&lt;/th&gt;
&lt;th&gt;Demo Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Token Usage&lt;/td&gt;
&lt;td&gt;Input/output/total tokens&lt;/td&gt;
&lt;td&gt;Cost estimation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance&lt;/td&gt;
&lt;td&gt;Latency and execution times&lt;/td&gt;
&lt;td&gt;Benchmark tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool Usage&lt;/td&gt;
&lt;td&gt;Call counts and success rates&lt;/td&gt;
&lt;td&gt;Reliability assessment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event Loops&lt;/td&gt;
&lt;td&gt;Reasoning cycles&lt;/td&gt;
&lt;td&gt;Efficiency analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This integrates seamlessly with AWS X-Ray and CloudWatch for enterprise observability patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Innovation: Multi-Server Financial Intelligence Demo
&lt;/h2&gt;

&lt;p&gt;Our demo showcases financial analysis requiring coordination of multiple specialized systems. That's where &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; becomes critical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding MCP: The USB-C for AI
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/news/model-context-protocol" rel="noopener noreferrer"&gt;Anthropic open-sourced MCP in November 2024&lt;/a&gt; to solve the "N×M problem"—every model needing connectors to every data source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP provides a universal standard&lt;/strong&gt;: One protocol, any model, any data source. Major providers including OpenAI and Google DeepMind adopted it within months.&lt;/p&gt;

&lt;p&gt;The protocol uses JSON-RPC 2.0 with three primitives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt;: Executable functions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt;: Structured data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompts&lt;/strong&gt;: Instruction templates&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture Overview: How Everything Connects
&lt;/h2&gt;

&lt;p&gt;Here's the complete system architecture showing how AWS Strands orchestrates multiple MCP servers through SAP GenAI Hub:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fok7r0lejwwpnkuwsmuem.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fok7r0lejwwpnkuwsmuem.webp" alt="Financial Intelligence MCP Agent Architecture" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Walking Through the Architecture (4 Key Stages)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Enterprise User Request&lt;/strong&gt;&lt;br&gt;
Enterprise users interact with the AWS Strands Agent through SAP GenAI Hub, which provides the secure gateway to Anthropic's Claude models via Amazon Bedrock.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: AI Agent Orchestration&lt;/strong&gt;&lt;br&gt;
The AWS Strands SDK handles multi-tool coordination. The MCP Client within Strands manages all communications with downstream servers, reasoning about which tools to invoke and when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: MCP Protocol Communications&lt;/strong&gt;&lt;br&gt;
The MCP Session Manager maintains persistent connections to all three specialized servers, aggregating 10+ financial tools into a unified interface. This eliminates connection overhead and provides seamless cross-server coordination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 4: Orchestrated Results&lt;/strong&gt;&lt;br&gt;
The system synthesizes data from all servers to produce comprehensive outputs: investment analysis reports, risk assessment matrices, sentiment analysis, and cross-server coordination reports.&lt;/p&gt;
&lt;h3&gt;
  
  
  Three Specialized MCP Servers (Demo Architecture)
&lt;/h3&gt;

&lt;p&gt;We built three demo servers, each handling distinct financial intelligence capabilities:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Financial Data&lt;/td&gt;
&lt;td&gt;8001&lt;/td&gt;
&lt;td&gt;FastAPI (Manual)&lt;/td&gt;
&lt;td&gt;Real-time market data&lt;/td&gt;
&lt;td&gt;Stock quotes, fundamentals, health scoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document Analysis&lt;/td&gt;
&lt;td&gt;8002&lt;/td&gt;
&lt;td&gt;FastMCP Framework&lt;/td&gt;
&lt;td&gt;Sentiment analysis&lt;/td&gt;
&lt;td&gt;PDF parsing, report analysis, metric extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analytics&lt;/td&gt;
&lt;td&gt;8003&lt;/td&gt;
&lt;td&gt;FastMCP Framework&lt;/td&gt;
&lt;td&gt;Advanced analytics&lt;/td&gt;
&lt;td&gt;Comparison charts, risk assessment, trend analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h4&gt;
  
  
  Why Two Approaches?
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;FastAPI (Manual Implementation)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full control over JSON-RPC protocol&lt;/li&gt;
&lt;li&gt;~150-200 lines for basic server&lt;/li&gt;
&lt;li&gt;Deep MCP understanding required&lt;/li&gt;
&lt;li&gt;Best for learning fundamentals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;FastMCP Framework&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic protocol handling&lt;/li&gt;
&lt;li&gt;~50-75 lines for basic server
&lt;/li&gt;
&lt;li&gt;3-4x faster development&lt;/li&gt;
&lt;li&gt;Production-ready features built-in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both approaches demonstrate viable patterns. Your choice depends on control vs. velocity requirements.&lt;/p&gt;

&lt;p&gt;Here's a FastMCP server example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document-analysis-server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_financial_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Analyze financial text for sentiment and insights&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;positive_keywords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;growth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;profit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;strong&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;improved&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;negative_keywords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;decline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;loss&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;weak&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reduced&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Sentiment analysis logic
&lt;/span&gt;    &lt;span class="n"&gt;sentiment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_sentiment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;positive_keywords&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;negative_keywords&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sentiment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key_findings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;extract_findings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;identified_risks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;identify_risks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8002&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Session Manager Pattern
&lt;/h2&gt;

&lt;p&gt;Managing connections to three MCP servers in our demo required persistent sessions without context manager complexity—as shown in Stage 3 of the architecture diagram.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution&lt;/strong&gt;: A custom &lt;code&gt;MCPSessionManager&lt;/code&gt; using Python's &lt;code&gt;ExitStack&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;util.mcp_session_manager&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MCPSessionManager&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize manager
&lt;/span&gt;&lt;span class="n"&gt;mcp_manager&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MCPSessionManager&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Establish persistent connections (Stage 3)
&lt;/span&gt;&lt;span class="n"&gt;mcp_manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_sessions&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;financial_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:8001/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:8002/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analytics_reporting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://127.0.0.1:8003/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Aggregate tools from all servers
&lt;/span&gt;&lt;span class="n"&gt;all_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mcp_manager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_all_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Create unified agent (Stage 2)
&lt;/span&gt;&lt;span class="n"&gt;financial_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sap_genai_hub_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;all_tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;financial_expert_prompt&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern eliminates boilerplate while demonstrating enterprise requirements: connection pooling, error recovery, and audit logging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo Results: AWS + SAP Integration in Action
&lt;/h2&gt;

&lt;p&gt;Following the architecture flow from Stage 1 → Stage 4, when a user asks: &lt;em&gt;"Provide comprehensive investment analysis for SAP"&lt;/em&gt;, the agent automatically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fetches stock data&lt;/strong&gt; (Financial Server) → Current metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analyzes sentiment&lt;/strong&gt; (Document Server) → Report assessment
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calculates risk&lt;/strong&gt; (Analytics Server) → Investment scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesizes report&lt;/strong&gt; (Stage 4 Outputs) → Executive-ready recommendation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;No explicit orchestration&lt;/strong&gt;. No hardcoded workflows. The agent reasons about tool usage and coordinates automatically across all three MCP servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Demo Performance Metrics
&lt;/h3&gt;

&lt;p&gt;From our Devtoberfest proof-of-concept:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;30% potential reduction&lt;/strong&gt; in comprehensive financial analysis time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10-20% efficiency gains&lt;/strong&gt; demonstrated for individual stock analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic metrics tracking&lt;/strong&gt; via AWS Strands observability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-ready monitoring patterns&lt;/strong&gt; through CloudWatch integration&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Enterprise Security: SAP GenAI Hub Integration (Stage 1)
&lt;/h2&gt;

&lt;p&gt;The demo showcases how &lt;a href="https://aws.amazon.com/blogs/awsforsap/power-your-business-with-secure-and-scalable-generative-ai-services-from-aws-and-sap/" rel="noopener noreferrer"&gt;SAP Generative AI Hub&lt;/a&gt; provides critical governance when integrating with AWS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Content filtering&lt;/strong&gt; on inputs and outputs&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Data masking&lt;/strong&gt; for sensitive information
&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Centralized policies&lt;/strong&gt; across SAP ecosystem&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Compliance support&lt;/strong&gt; for regulatory requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Hub orchestrates access to Amazon Bedrock models (Claude 3.5, Titan) while maintaining security boundaries essential for enterprise deployments—all happening at Stage 1 of our architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use This Architecture Pattern
&lt;/h2&gt;

&lt;p&gt;This demo architecture excels when you need to:&lt;/p&gt;

&lt;p&gt;✅ Coordinate 3+ specialized systems or data sources&lt;br&gt;
✅ Rapid prototyping with clear production path&lt;br&gt;
✅ Model-driven flexibility over explicit workflows&lt;br&gt;
✅ Standard protocols (MCP) for future extensibility&lt;br&gt;&lt;br&gt;
✅ Built-in observability for production monitoring&lt;br&gt;
✅ Enterprise security and governance&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next: From Demo to Production
&lt;/h2&gt;

&lt;p&gt;The demo system showcases integration possibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SAP Integration&lt;/strong&gt;: Connect MCP servers to SAP business processes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Tenant Deployments&lt;/strong&gt;: Shared MCP infrastructure for multiple organizations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Architectures&lt;/strong&gt;: On-premises SAP + cloud-native AI services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain-Specific Agents&lt;/strong&gt;: Specialized agents for procurement, finance, HR&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Both notebooks are available in our &lt;a href="https://github.com/AbrahamArellano/sample-sap-genai-hub-bedrock" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;. The progression from research agent to multi-server orchestration provides a practical learning path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start Simple&lt;/strong&gt;: Build single-agent systems first&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learn MCP&lt;/strong&gt;: Understand the protocol fundamentals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale Thoughtfully&lt;/strong&gt;: Use frameworks and patterns for production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure by Design&lt;/strong&gt;: Implement proper auth, audit, monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observe Everything&lt;/strong&gt;: Leverage built-in observability&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Full Technical Deep-Dive + Video Tutorial
&lt;/h2&gt;

&lt;p&gt;Want the complete implementation with detailed architecture walkthroughs, video tutorial, and production deployment guidance?&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://myitbasics.com/how-build-agent-orchestration-ai-systems-aws-sap/" rel="noopener noreferrer"&gt;Watch the video tutorial and read the full guide on MyITBasics&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Step-by-step video tutorial&lt;/strong&gt; walking through the entire demo&lt;/li&gt;
&lt;li&gt;Detailed MCP protocol implementation&lt;/li&gt;
&lt;li&gt;AWS and SAP integration patterns&lt;/li&gt;
&lt;li&gt;High-resolution architecture diagrams&lt;/li&gt;
&lt;li&gt;Cost analysis and ROI calculations
&lt;/li&gt;
&lt;li&gt;AgentCore platform integration&lt;/li&gt;
&lt;li&gt;Enterprise architecture considerations&lt;/li&gt;
&lt;li&gt;Complete code samples and notebooks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Discussion Questions
&lt;/h2&gt;

&lt;p&gt;I'd love to hear your experiences:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What challenges have you faced orchestrating multiple AI agents?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How do you approach AWS and SAP GenAI integration in your projects?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What's your strategy for securing enterprise AI integrations?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;Abraham Arellano Tavara | Senior Solutions Architect, AWS Munich | &lt;a href="https://www.linkedin.com/in/abraham-arellano-tavara/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  AWS #SAP #AIAgents #EnterpriseAI
&lt;/h1&gt;

</description>
      <category>aws</category>
      <category>sap</category>
      <category>ai</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
