<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: krishnakanth eswaran</title>
    <description>The latest articles on DEV Community by krishnakanth eswaran (@krishnakanth_eswaran_6000).</description>
    <link>https://dev.to/krishnakanth_eswaran_6000</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906570%2F818daf3e-08f0-4952-8ae9-31fcdc31f355.png</url>
      <title>DEV Community: krishnakanth eswaran</title>
      <link>https://dev.to/krishnakanth_eswaran_6000</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/krishnakanth_eswaran_6000"/>
    <language>en</language>
    <item>
      <title>Zero-Downtime ECS EKS Migration: Orchestrating a 6-Team Production Cutover at Scale</title>
      <dc:creator>krishnakanth eswaran</dc:creator>
      <pubDate>Thu, 30 Apr 2026 18:45:10 +0000</pubDate>
      <link>https://dev.to/krishnakanth_eswaran_6000/zero-downtime-ecs-eks-migration-orchestrating-a-6-team-production-cutover-at-scale-1pe6</link>
      <guid>https://dev.to/krishnakanth_eswaran_6000/zero-downtime-ecs-eks-migration-orchestrating-a-6-team-production-cutover-at-scale-1pe6</guid>
      <description>&lt;p&gt;Task at hand: Migrating Live Healthcare Services Without Dropping a Single Request&lt;/p&gt;

&lt;p&gt;When you're processing healthcare revenue cycle transactions worth millions of dollars daily, downtime isn't just inconvenient—it's financially catastrophic and potentially impacts patient care. This is the story of how we migrated 15+ microservices from AWS ECS to EKS across 6 engineering teams with zero downtime, zero rollbacks, and zero production incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The stakes:&lt;/strong&gt; AR Finance and Posting Modernisation services handling real-time remittance processing for U.S. healthcare providers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The constraint:&lt;/strong&gt; Absolute zero tolerance for downtime or data loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scope:&lt;/strong&gt; Domain-wide cutover coordinating Rules Core, Payment Processing, Reconciliation, Analytics, Data Pipeline, and Platform teams.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why We Migrated: ECS Limitations at Scale
&lt;/h2&gt;

&lt;p&gt;Our ECS-based architecture was showing cracks:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Autoscaling Lag During Traffic Spikes&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ECS service autoscaling based on CloudWatch metrics had a 3-5 minute delay. During month-end processing windows, we'd see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU spike to 85%+ before scale-out triggered&lt;/li&gt;
&lt;li&gt;30-45 second P99 latencies while waiting for new tasks&lt;/li&gt;
&lt;li&gt;Manual intervention required to pre-scale services&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Resource Bin-Packing Inefficiency&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ECS task placement was leaving 20-30% cluster capacity unused due to fragmentation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EC2 Instance: 8 vCPU, 16GB RAM
Task A: 2 vCPU, 4GB  ✓
Task B: 2 vCPU, 4GB  ✓
Task C: 4 vCPU, 6GB  ✗ (not enough contiguous resources)
→ 4 vCPU, 8GB sitting idle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Secrets Management Complexity&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We were using SSM Parameter Store with custom init containers to inject secrets, leading to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Secrets rotations requiring task restarts&lt;/li&gt;
&lt;li&gt;Verbose task definitions with 50+ environment variables&lt;/li&gt;
&lt;li&gt;No audit trail for secret access&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Limited Observability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ECS metrics were service-level only. Pod-level insights required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom CloudWatch dashboards&lt;/li&gt;
&lt;li&gt;X-Ray instrumentation for every service&lt;/li&gt;
&lt;li&gt;Log aggregation gymnastics across task IDs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The decision:&lt;/strong&gt; Migrate to EKS for KEDA-based event-driven autoscaling, better resource utilization, native Kubernetes secrets operators, and richer observability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: The Before and After
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Before: ECS Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│  Application Load Balancer                      │
└──────────────┬──────────────────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼────────┐     ┌─────▼──────┐
│ ECS Service│     │ ECS Service│
│  (Task A)  │     │  (Task B)  │
│            │     │            │
│ SSM Params │     │ SSM Params │
└─────┬──────┘     └──────┬─────┘
      │                   │
      └─────────┬─────────┘
                │
         ┌──────▼───────┐
         │  RDS/MSK/S3  │
         └──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  After: EKS Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│  Application Load Balancer (AWS LB Controller)  │
└──────────────┬──────────────────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼────────────┐  ┌────▼───────────┐
│ K8s Deployment │  │ K8s Deployment │
│   + Service    │  │   + Service    │
│                │  │                │
│ KEDA Scaler    │  │ KEDA Scaler    │
│ (SQS/Kafka)    │  │ (Prometheus)   │
│                │  │                │
│ ExternalSecret │  │ ExternalSecret │
│ (Vault sync)   │  │ (Vault sync)   │
└─────┬──────────┘  └──────┬─────────┘
      │                    │
      └──────────┬─────────┘
                 │
          ┌──────▼────────┐
          │   RDS/MSK/S3  │
          │   (IRSA auth) │
          └───────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Migration Strategy: Blue-Green at the Load Balancer
&lt;/h2&gt;

&lt;p&gt;We chose &lt;strong&gt;target group-level blue-green deployment&lt;/strong&gt; to enable instantaneous rollback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALB
 │
 ├─► Target Group A (ECS tasks)    [90% traffic] ← Active
 │
 └─► Target Group B (EKS pods)     [10% traffic] ← Canary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Traffic shift progression:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Week 1:&lt;/strong&gt; ECS 100% → EKS 0% (deployment validation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 2:&lt;/strong&gt; ECS 90% → EKS 10% (canary with real traffic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 3:&lt;/strong&gt; ECS 50% → EKS 50% (split validation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 4:&lt;/strong&gt; ECS 10% → EKS 90% (confidence threshold)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Week 5:&lt;/strong&gt; ECS 0% → EKS 100% (full cutover)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Rollback mechanism:&lt;/strong&gt; Single ALB rule weight change (15-second propagation) vs. hours for task/pod redeployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Technical Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. IRSA (IAM Roles for Service Accounts) for AWS Authentication
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; ECS task roles were instance-wide. In EKS, we needed pod-level IAM permissions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; IRSA with OIDC provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ServiceAccount with IAM role annotation&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;remittance-processor-sa&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;finance&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789:role/RemittanceProcessorRole&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Terraform: IAM role with OIDC trust&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role"&lt;/span&gt; &lt;span class="s2"&gt;"remittance_processor"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"RemittanceProcessorRole"&lt;/span&gt;

  &lt;span class="nx"&gt;assume_role_policy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;Version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;
    &lt;span class="nx"&gt;Statement&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
      &lt;span class="nx"&gt;Effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Allow"&lt;/span&gt;
      &lt;span class="nx"&gt;Principal&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Federated&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_openid_connect_provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="nx"&gt;Action&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;
      &lt;span class="nx"&gt;Condition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;StringEquals&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"${replace(aws_iam_openid_connect_provider.eks.url, "&lt;/span&gt;&lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="c1"&gt;//", "")}:sub": &lt;/span&gt;
            &lt;span class="s2"&gt;"system:serviceaccount:finance:remittance-processor-sa"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_iam_role_policy_attachment"&lt;/span&gt; &lt;span class="s2"&gt;"s3_access"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;remittance_processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;policy_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Pods automatically assume IAM roles via projected service account tokens. No static credentials in containers.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. KEDA for Event-Driven Autoscaling
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; ECS autoscaling on CPU/memory was reactive, not predictive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; KEDA scalers monitoring actual workload queues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;keda.sh/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ScaledObject&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;remittance-processor-scaler&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;finance&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;remittance-processor&lt;/span&gt;
  &lt;span class="na"&gt;minReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
  &lt;span class="na"&gt;pollingInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;  &lt;span class="c1"&gt;# Check queue depth every 15s&lt;/span&gt;
  &lt;span class="na"&gt;cooldownPeriod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;   &lt;span class="c1"&gt;# Wait 60s before scaling down&lt;/span&gt;
  &lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-sqs-queue&lt;/span&gt;
      &lt;span class="na"&gt;authenticationRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;keda-aws-credentials&lt;/span&gt;
      &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;queueURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://sqs.us-east-1.amazonaws.com/123456789/remittance-queue&lt;/span&gt;
        &lt;span class="na"&gt;queueLength&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10"&lt;/span&gt;  &lt;span class="c1"&gt;# Target 10 messages per pod&lt;/span&gt;
        &lt;span class="na"&gt;awsRegion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
        &lt;span class="na"&gt;identityOwner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;operator&lt;/span&gt;  &lt;span class="c1"&gt;# Use IRSA&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before (ECS):&lt;/strong&gt; 3-5 minute scale-out lag → P99 latency spikes to 30-45s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After (KEDA):&lt;/strong&gt; 15-second scale-out trigger → P99 latency stays under 5s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During month-end processing (5,000 msg/min spike), KEDA scaled from 5→42 pods in &lt;strong&gt;under 2 minutes&lt;/strong&gt; vs. 8-10 minutes with ECS.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. ExternalSecrets + HashiCorp Vault
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Secrets rotation in ECS required task restarts and deployment pipelines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; ExternalSecrets Operator syncing Vault → Kubernetes Secrets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ExternalSecret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-credentials&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;finance&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;refreshInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;  &lt;span class="c1"&gt;# Sync every hour&lt;/span&gt;
  &lt;span class="na"&gt;secretStoreRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vault-backend&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretStore&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-credentials-secret&lt;/span&gt;
    &lt;span class="na"&gt;creationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Owner&lt;/span&gt;
  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;username&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;database/prod/remittance&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;username&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;database/prod/remittance&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Application consumption:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Deployment using the synced secret&lt;/span&gt;
&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_USERNAME&lt;/span&gt;
    &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-credentials-secret&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;username&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD&lt;/span&gt;
    &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-credentials-secret&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Vault rotates DB passwords every 30 days → ExternalSecrets syncs → Pods pick up new secrets on next restart (rolling deployment) without manual intervention.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Harness CD for Coordinated Rollouts
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Challenge:&lt;/strong&gt; 6 teams, 15+ services, different deployment schedules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Harness pipelines with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Canary stages:&lt;/strong&gt; 10% → 50% → 100% traffic shifts with automated rollback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval gates:&lt;/strong&gt; Lead SRE sign-off before production shifts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel deployments:&lt;/strong&gt; Non-dependent services deploy concurrently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure strategies:&lt;/strong&gt; Auto-rollback on P99 latency &amp;gt; 10s or error rate &amp;gt; 0.5%
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Harness canary deployment snippet&lt;/span&gt;
&lt;span class="na"&gt;stages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Canary Deployment&lt;/span&gt;
      &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;execution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;step&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8sCanaryDeploy&lt;/span&gt;
                &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;instanceSelection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Count&lt;/span&gt;
                    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# 1 pod canary&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;step&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8sCanaryDelete&lt;/span&gt;
                &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;skipDryRun&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;step&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;K8sRollingDeploy&lt;/span&gt;
                &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;skipDryRun&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Cutover Week: Hour-by-Hour Execution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Monday: Final Validation (ECS 100%, EKS 0%)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;08:00 AM:&lt;/strong&gt; Deploy all EKS services to production (no traffic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10:00 AM:&lt;/strong&gt; Validate pod health, IRSA permissions, ExternalSecrets sync&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12:00 PM:&lt;/strong&gt; Run smoke tests against EKS endpoints (bypassing ALB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;02:00 PM:&lt;/strong&gt; Verify KEDA scalers respond to synthetic load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;04:00 PM:&lt;/strong&gt; Go/No-Go meeting → &lt;strong&gt;GO&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tuesday: 10% Canary (ECS 90%, EKS 10%)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;12:00 AM:&lt;/strong&gt; Shift 10% ALB traffic to EKS target group&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;12:00 AM - 11:59 PM:&lt;/strong&gt; Monitor dashboards:

&lt;ul&gt;
&lt;li&gt;P50/P95/P99 latencies (CloudWatch + Prometheus)&lt;/li&gt;
&lt;li&gt;Error rates (application logs + OpenSearch)&lt;/li&gt;
&lt;li&gt;KEDA scaling events&lt;/li&gt;
&lt;li&gt;Vault secret access audit logs&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Metrics (24-hour comparison):&lt;/strong&gt;&lt;br&gt;
| Metric | ECS Baseline | EKS Canary | Delta |&lt;br&gt;
|--------|--------------|------------|-------|&lt;br&gt;
| P99 Latency | 1,240ms | 890ms | &lt;strong&gt;-28%&lt;/strong&gt; ✓ |&lt;br&gt;
| Error Rate | 0.12% | 0.09% | &lt;strong&gt;-25%&lt;/strong&gt; ✓ |&lt;br&gt;
| Autoscale Lag | 185s | 22s | &lt;strong&gt;-88%&lt;/strong&gt; ✓ |&lt;/p&gt;
&lt;h3&gt;
  
  
  Wednesday-Thursday: 50% Split (ECS 50%, EKS 50%)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Observation:&lt;/strong&gt; EKS pods stabilized at 30% lower replica count for same throughput (better bin-packing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Impact:&lt;/strong&gt; Estimated 18% reduction in EC2 costs at full migration&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Friday: 90% Confidence (ECS 10%, EKS 90%)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Peak Load Test:&lt;/strong&gt; Month-end processing simulation (5K msgs/min)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; KEDA scaled 5→38 pods in 90 seconds, P99 stayed under 4s&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Monday Week 2: Full Cutover (ECS 0%, EKS 100%)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;08:00 AM:&lt;/strong&gt; Shift final 10% traffic to EKS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;08:30 AM:&lt;/strong&gt; ECS tasks draining (no new connections)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;09:00 AM:&lt;/strong&gt; ECS cluster scaled to 0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10:00 AM:&lt;/strong&gt; &lt;strong&gt;Migration Complete ✓&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Final Scorecard:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Downtime:&lt;/strong&gt; 0 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rollbacks:&lt;/strong&gt; 0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production Incidents:&lt;/strong&gt; 0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Loss:&lt;/strong&gt; 0 records&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. &lt;strong&gt;IRSA Trust Policy Gotchas&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We hit this error initially:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: failed to assume role: AccessDenied
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; OIDC provider thumbprint mismatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Regenerate thumbprint after EKS cluster upgrade:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws eks describe-cluster &lt;span class="nt"&gt;--name&lt;/span&gt; prod-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s2"&gt;"cluster.identity.oidc.issuer"&lt;/span&gt; &lt;span class="nt"&gt;--output&lt;/span&gt; text

&lt;span class="c"&gt;# Extract thumbprint using OpenSSL&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; | openssl s_client &lt;span class="nt"&gt;-servername&lt;/span&gt; oidc.eks.us-east-1.amazonaws.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-connect&lt;/span&gt; oidc.eks.us-east-1.amazonaws.com:443 2&amp;gt;/dev/null &lt;span class="se"&gt;\&lt;/span&gt;
  | openssl x509 &lt;span class="nt"&gt;-fingerprint&lt;/span&gt; &lt;span class="nt"&gt;-noout&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="s1"&gt;'s/://g'&lt;/span&gt; | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="nt"&gt;-F&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'{print tolower($2)}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;ExternalSecrets Refresh Interval Tuning&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Initial &lt;code&gt;refreshInterval: 5m&lt;/code&gt; caused:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;300+ Vault API calls/min across all pods&lt;/li&gt;
&lt;li&gt;Vault rate limiting (429 errors)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Increased to &lt;code&gt;1h&lt;/code&gt; with manual sync trigger via annotation for urgent rotations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl annotate externalsecret db-credentials &lt;span class="se"&gt;\&lt;/span&gt;
  force-sync&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;--overwrite&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;KEDA Cooldown Period Matters&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Early deployments had &lt;code&gt;cooldownPeriod: 30s&lt;/code&gt;, causing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggressive scale-downs during brief traffic lulls&lt;/li&gt;
&lt;li&gt;Thrashing (scale up → scale down → scale up)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Increased to &lt;code&gt;60s&lt;/code&gt; and added &lt;code&gt;stabilizationWindowSeconds&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleDown&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;  &lt;span class="c1"&gt;# Wait 5 min before scale-down&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. &lt;strong&gt;Harness Rollback Edge Case&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;During one canary, a pod crashlooped due to a config typo. Harness auto-rollback triggered, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EKS deployment was rolled back ✓&lt;/li&gt;
&lt;li&gt;ALB target group weights were &lt;strong&gt;not&lt;/strong&gt; reset ✗&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Added explicit ALB rule weight reset in Harness failure strategy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight groovy"&gt;&lt;code&gt;&lt;span class="nl"&gt;onFailure:&lt;/span&gt;
  &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nl"&gt;step:&lt;/span&gt; &lt;span class="n"&gt;ShellScript&lt;/span&gt;
      &lt;span class="nl"&gt;script:&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="n"&gt;aws&lt;/span&gt; &lt;span class="n"&gt;elbv2&lt;/span&gt; &lt;span class="n"&gt;modify&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;arn&lt;/span&gt; &lt;span class="n"&gt;$RULE_ARN&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;
          &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;conditions&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Values&lt;/span&gt;&lt;span class="o"&gt;=/*&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;
          &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt; &lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;forward&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="n"&gt;TargetGroupArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;$ECS_TG&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Quantified Impact
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Performance Improvements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;P99 Latency:&lt;/strong&gt; 1,240ms → 890ms (&lt;strong&gt;-28%&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autoscale Response:&lt;/strong&gt; 185s → 22s (&lt;strong&gt;-88%&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pod Density:&lt;/strong&gt; 2.3 pods/node → 3.8 pods/node (&lt;strong&gt;+65%&lt;/strong&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost Savings
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EC2 Compute:&lt;/strong&gt; ~18% reduction (better bin-packing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets Management:&lt;/strong&gt; Eliminated SSM Parameter Store costs ($1,200/month)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Native Prometheus/Grafana vs. paid CloudWatch dashboards ($800/month saved)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Operational Efficiency
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deployment Frequency:&lt;/strong&gt; 2-3 times/week → 8-12 times/week (faster iteration)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets Rotation:&lt;/strong&gt; Manual 4-hour process → Automated hourly sync&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident Response:&lt;/strong&gt; Mean-time-to-recovery reduced from 45 min → 12 min (faster pod restarts)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways for Your Migration
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with Non-Critical Services:&lt;/strong&gt; Don't migrate your revenue-critical path first. We started with batch processing jobs to validate the EKS infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;IRSA is Non-Negotiable:&lt;/strong&gt; Hardcoded AWS credentials or instance profiles are security anti-patterns. Invest time in IRSA setup upfront.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;KEDA Transforms Autoscaling:&lt;/strong&gt; If you have event-driven workloads (queues, Kafka, cron jobs), KEDA is a game-changer. It scales on actual work, not proxy metrics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Blue-Green at the ALB Level:&lt;/strong&gt; Don't underestimate the psychological safety of instant rollback. It enabled aggressive cutover timelines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Observability Parity First:&lt;/strong&gt; Ensure EKS monitoring matches ECS before migration. We instrumented Prometheus metrics, Grafana dashboards, and OpenSearch logging in parallel with ECS for 2 weeks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Team Coordination &amp;gt; Tech:&lt;/strong&gt; The hardest part wasn't Kubernetes—it was aligning 6 teams on deployment schedules, rollback procedures, and communication protocols.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;Now that we've migrated to EKS, we're exploring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Istio service mesh&lt;/strong&gt; for advanced traffic management and mTLS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Argo CD&lt;/strong&gt; for GitOps-driven deployments (replacing Harness)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertical Pod Autoscaler (VPA)&lt;/strong&gt; for right-sizing pod resource requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Autoscaler with Karpenter&lt;/strong&gt; for faster node provisioning&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Questions? Let's Discuss!
&lt;/h2&gt;

&lt;p&gt;If you're planning an ECS→EKS migration or have gone through one, I'd love to hear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What was your biggest surprise during the migration?&lt;/li&gt;
&lt;li&gt;How did you handle database connection draining during cutover?&lt;/li&gt;
&lt;li&gt;Any KEDA scaler gotchas we should watch for?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop your thoughts in the comments or connect with me on &lt;a href="https://linkedin.com/in/krishnakanth-e" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags to Use:&lt;/strong&gt; &lt;code&gt;#kubernetes&lt;/code&gt; &lt;code&gt;#aws&lt;/code&gt; &lt;code&gt;#devops&lt;/code&gt; &lt;code&gt;#eks&lt;/code&gt; &lt;code&gt;#cloudnative&lt;/code&gt; &lt;code&gt;#sre&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Suggested Cover Image:&lt;/strong&gt; Create a simple diagram showing ECS→EKS migration flow or use an abstract Kubernetes logo-inspired design.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>microservices</category>
    </item>
  </channel>
</rss>
