krishnakanth eswaran

Posted on Apr 30

Zero-Downtime ECS EKS Migration: Orchestrating a 6-Team Production Cutover at Scale

#aws #kubernetes #devops #microservices

Task at hand: Migrating Live Healthcare Services Without Dropping a Single Request

When you're processing healthcare revenue cycle transactions worth millions of dollars daily, downtime isn't just inconvenient—it's financially catastrophic and potentially impacts patient care. This is the story of how we migrated 15+ microservices from AWS ECS to EKS across 6 engineering teams with zero downtime, zero rollbacks, and zero production incidents.

The stakes: AR Finance and Posting Modernisation services handling real-time remittance processing for U.S. healthcare providers.

The constraint: Absolute zero tolerance for downtime or data loss.

The scope: Domain-wide cutover coordinating Rules Core, Payment Processing, Reconciliation, Analytics, Data Pipeline, and Platform teams.

Why We Migrated: ECS Limitations at Scale

Our ECS-based architecture was showing cracks:

1. Autoscaling Lag During Traffic Spikes

ECS service autoscaling based on CloudWatch metrics had a 3-5 minute delay. During month-end processing windows, we'd see:

CPU spike to 85%+ before scale-out triggered
30-45 second P99 latencies while waiting for new tasks
Manual intervention required to pre-scale services

2. Resource Bin-Packing Inefficiency

ECS task placement was leaving 20-30% cluster capacity unused due to fragmentation:

EC2 Instance: 8 vCPU, 16GB RAM
Task A: 2 vCPU, 4GB  ✓
Task B: 2 vCPU, 4GB  ✓
Task C: 4 vCPU, 6GB  ✗ (not enough contiguous resources)
→ 4 vCPU, 8GB sitting idle

3. Secrets Management Complexity

We were using SSM Parameter Store with custom init containers to inject secrets, leading to:

Secrets rotations requiring task restarts
Verbose task definitions with 50+ environment variables
No audit trail for secret access

4. Limited Observability

ECS metrics were service-level only. Pod-level insights required:

Custom CloudWatch dashboards
X-Ray instrumentation for every service
Log aggregation gymnastics across task IDs

The decision: Migrate to EKS for KEDA-based event-driven autoscaling, better resource utilization, native Kubernetes secrets operators, and richer observability.

Architecture: The Before and After

Before: ECS Architecture

┌─────────────────────────────────────────────────┐
│  Application Load Balancer                      │
└──────────────┬──────────────────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼────────┐     ┌─────▼──────┐
│ ECS Service│     │ ECS Service│
│  (Task A)  │     │  (Task B)  │
│            │     │            │
│ SSM Params │     │ SSM Params │
└─────┬──────┘     └──────┬─────┘
      │                   │
      └─────────┬─────────┘
                │
         ┌──────▼───────┐
         │  RDS/MSK/S3  │
         └──────────────┘

After: EKS Architecture

┌─────────────────────────────────────────────────┐
│  Application Load Balancer (AWS LB Controller)  │
└──────────────┬──────────────────────────────────┘
               │
    ┌──────────┴──────────┐
    │                     │
┌───▼────────────┐  ┌────▼───────────┐
│ K8s Deployment │  │ K8s Deployment │
│   + Service    │  │   + Service    │
│                │  │                │
│ KEDA Scaler    │  │ KEDA Scaler    │
│ (SQS/Kafka)    │  │ (Prometheus)   │
│                │  │                │
│ ExternalSecret │  │ ExternalSecret │
│ (Vault sync)   │  │ (Vault sync)   │
└─────┬──────────┘  └──────┬─────────┘
      │                    │
      └──────────┬─────────┘
                 │
          ┌──────▼────────┐
          │   RDS/MSK/S3  │
          │   (IRSA auth) │
          └───────────────┘

The Migration Strategy: Blue-Green at the Load Balancer

We chose target group-level blue-green deployment to enable instantaneous rollback:

ALB
 │
 ├─► Target Group A (ECS tasks)    [90% traffic] ← Active
 │
 └─► Target Group B (EKS pods)     [10% traffic] ← Canary

Traffic shift progression:

Week 1: ECS 100% → EKS 0% (deployment validation)
Week 2: ECS 90% → EKS 10% (canary with real traffic)
Week 3: ECS 50% → EKS 50% (split validation)
Week 4: ECS 10% → EKS 90% (confidence threshold)
Week 5: ECS 0% → EKS 100% (full cutover)

Rollback mechanism: Single ALB rule weight change (15-second propagation) vs. hours for task/pod redeployment.

Key Technical Decisions

1. IRSA (IAM Roles for Service Accounts) for AWS Authentication

Problem: ECS task roles were instance-wide. In EKS, we needed pod-level IAM permissions.

Solution: IRSA with OIDC provider:

# ServiceAccount with IAM role annotation
apiVersion: v1
kind: ServiceAccount
metadata:
  name: remittance-processor-sa
  namespace: finance
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/RemittanceProcessorRole

# Terraform: IAM role with OIDC trust
resource "aws_iam_role" "remittance_processor" {
  name = "RemittanceProcessorRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.eks.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:sub": 
            "system:serviceaccount:finance:remittance-processor-sa"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "s3_access" {
  role       = aws_iam_role.remittance_processor.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}

Result: Pods automatically assume IAM roles via projected service account tokens. No static credentials in containers.

2. KEDA for Event-Driven Autoscaling

Problem: ECS autoscaling on CPU/memory was reactive, not predictive.

Solution: KEDA scalers monitoring actual workload queues:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: remittance-processor-scaler
  namespace: finance
spec:
  scaleTargetRef:
    name: remittance-processor
  minReplicaCount: 5
  maxReplicaCount: 50
  pollingInterval: 15  # Check queue depth every 15s
  cooldownPeriod: 60   # Wait 60s before scaling down
  triggers:
    - type: aws-sqs-queue
      authenticationRef:
        name: keda-aws-credentials
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/remittance-queue
        queueLength: "10"  # Target 10 messages per pod
        awsRegion: us-east-1
        identityOwner: operator  # Use IRSA

Impact:

Before (ECS): 3-5 minute scale-out lag → P99 latency spikes to 30-45s
After (KEDA): 15-second scale-out trigger → P99 latency stays under 5s

During month-end processing (5,000 msg/min spike), KEDA scaled from 5→42 pods in under 2 minutes vs. 8-10 minutes with ECS.

3. ExternalSecrets + HashiCorp Vault

Problem: Secrets rotation in ECS required task restarts and deployment pipelines.

Solution: ExternalSecrets Operator syncing Vault → Kubernetes Secrets:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: finance
spec:
  refreshInterval: 1h  # Sync every hour
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  target:
    name: db-credentials-secret
    creationPolicy: Owner
  data:
    - secretKey: username
      remoteRef:
        key: database/prod/remittance
        property: username
    - secretKey: password
      remoteRef:
        key: database/prod/remittance
        property: password

Application consumption:

# Deployment using the synced secret
env:
  - name: DB_USERNAME
    valueFrom:
      secretKeyRef:
        name: db-credentials-secret
        key: username
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef:
        name: db-credentials-secret
        key: password

Result: Vault rotates DB passwords every 30 days → ExternalSecrets syncs → Pods pick up new secrets on next restart (rolling deployment) without manual intervention.

4. Harness CD for Coordinated Rollouts

Challenge: 6 teams, 15+ services, different deployment schedules.

Solution: Harness pipelines with:

Canary stages: 10% → 50% → 100% traffic shifts with automated rollback
Approval gates: Lead SRE sign-off before production shifts
Parallel deployments: Non-dependent services deploy concurrently
Failure strategies: Auto-rollback on P99 latency > 10s or error rate > 0.5%

# Harness canary deployment snippet
stages:
  - stage:
      name: Canary Deployment
      spec:
        execution:
          steps:
            - step:
                type: K8sCanaryDeploy
                spec:
                  instanceSelection:
                    type: Count
                    spec:
                      count: 1  # 1 pod canary
            - step:
                type: K8sCanaryDelete
                spec:
                  skipDryRun: false
            - step:
                type: K8sRollingDeploy
                spec:
                  skipDryRun: false

The Cutover Week: Hour-by-Hour Execution

Monday: Final Validation (ECS 100%, EKS 0%)

08:00 AM: Deploy all EKS services to production (no traffic)
10:00 AM: Validate pod health, IRSA permissions, ExternalSecrets sync
12:00 PM: Run smoke tests against EKS endpoints (bypassing ALB)
02:00 PM: Verify KEDA scalers respond to synthetic load
04:00 PM: Go/No-Go meeting → GO

Tuesday: 10% Canary (ECS 90%, EKS 10%)

12:00 AM: Shift 10% ALB traffic to EKS target group
12:00 AM - 11:59 PM: Monitor dashboards:
- P50/P95/P99 latencies (CloudWatch + Prometheus)
- Error rates (application logs + OpenSearch)
- KEDA scaling events
- Vault secret access audit logs

Metrics (24-hour comparison):
| Metric | ECS Baseline | EKS Canary | Delta |
|--------|--------------|------------|-------|
| P99 Latency | 1,240ms | 890ms | -28% ✓ |
| Error Rate | 0.12% | 0.09% | -25% ✓ |
| Autoscale Lag | 185s | 22s | -88% ✓ |

Wednesday-Thursday: 50% Split (ECS 50%, EKS 50%)

Observation: EKS pods stabilized at 30% lower replica count for same throughput (better bin-packing)
Cost Impact: Estimated 18% reduction in EC2 costs at full migration

Friday: 90% Confidence (ECS 10%, EKS 90%)

Peak Load Test: Month-end processing simulation (5K msgs/min)
Result: KEDA scaled 5→38 pods in 90 seconds, P99 stayed under 4s

Monday Week 2: Full Cutover (ECS 0%, EKS 100%)

08:00 AM: Shift final 10% traffic to EKS
08:30 AM: ECS tasks draining (no new connections)
09:00 AM: ECS cluster scaled to 0
10:00 AM: Migration Complete ✓

Final Scorecard:

Downtime: 0 seconds
Rollbacks: 0
Production Incidents: 0
Data Loss: 0 records

Lessons Learned

1. IRSA Trust Policy Gotchas

We hit this error initially:

Error: failed to assume role: AccessDenied

Root cause: OIDC provider thumbprint mismatch.

Fix: Regenerate thumbprint after EKS cluster upgrade:

aws eks describe-cluster --name prod-cluster \
  --query "cluster.identity.oidc.issuer" --output text

# Extract thumbprint using OpenSSL
echo | openssl s_client -servername oidc.eks.us-east-1.amazonaws.com \
  -connect oidc.eks.us-east-1.amazonaws.com:443 2>/dev/null \
  | openssl x509 -fingerprint -noout \
  | sed 's/://g' | awk -F= '{print tolower($2)}'

2. ExternalSecrets Refresh Interval Tuning

Initial refreshInterval: 5m caused:

300+ Vault API calls/min across all pods
Vault rate limiting (429 errors)

Solution: Increased to 1h with manual sync trigger via annotation for urgent rotations:

kubectl annotate externalsecret db-credentials \
  force-sync=$(date +%s) --overwrite

3. KEDA Cooldown Period Matters

Early deployments had cooldownPeriod: 30s, causing:

Aggressive scale-downs during brief traffic lulls
Thrashing (scale up → scale down → scale up)

Fix: Increased to 60s and added stabilizationWindowSeconds:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300  # Wait 5 min before scale-down

4. Harness Rollback Edge Case

During one canary, a pod crashlooped due to a config typo. Harness auto-rollback triggered, but:

EKS deployment was rolled back ✓
ALB target group weights were not reset ✗

Fix: Added explicit ALB rule weight reset in Harness failure strategy:

onFailure:
  - step: ShellScript
      script: |
        aws elbv2 modify-rule --rule-arn $RULE_ARN \
          --conditions Field=path-pattern,Values=/* \
          --actions Type=forward,TargetGroupArn=$ECS_TG,Weight=100

Quantified Impact

Performance Improvements

P99 Latency: 1,240ms → 890ms (-28%)
Autoscale Response: 185s → 22s (-88%)
Pod Density: 2.3 pods/node → 3.8 pods/node (+65%)

Cost Savings

EC2 Compute: ~18% reduction (better bin-packing)
Secrets Management: Eliminated SSM Parameter Store costs ($1,200/month)
Observability: Native Prometheus/Grafana vs. paid CloudWatch dashboards ($800/month saved)

Operational Efficiency

Deployment Frequency: 2-3 times/week → 8-12 times/week (faster iteration)
Secrets Rotation: Manual 4-hour process → Automated hourly sync
Incident Response: Mean-time-to-recovery reduced from 45 min → 12 min (faster pod restarts)

Key Takeaways for Your Migration

Start with Non-Critical Services: Don't migrate your revenue-critical path first. We started with batch processing jobs to validate the EKS infrastructure.
IRSA is Non-Negotiable: Hardcoded AWS credentials or instance profiles are security anti-patterns. Invest time in IRSA setup upfront.
KEDA Transforms Autoscaling: If you have event-driven workloads (queues, Kafka, cron jobs), KEDA is a game-changer. It scales on actual work, not proxy metrics.
Blue-Green at the ALB Level: Don't underestimate the psychological safety of instant rollback. It enabled aggressive cutover timelines.
Observability Parity First: Ensure EKS monitoring matches ECS before migration. We instrumented Prometheus metrics, Grafana dashboards, and OpenSearch logging in parallel with ECS for 2 weeks.
Team Coordination > Tech: The hardest part wasn't Kubernetes—it was aligning 6 teams on deployment schedules, rollback procedures, and communication protocols.

What's Next?

Now that we've migrated to EKS, we're exploring:

Istio service mesh for advanced traffic management and mTLS
Argo CD for GitOps-driven deployments (replacing Harness)
Vertical Pod Autoscaler (VPA) for right-sizing pod resource requests
Cluster Autoscaler with Karpenter for faster node provisioning

Questions? Let's Discuss!

If you're planning an ECS→EKS migration or have gone through one, I'd love to hear:

What was your biggest surprise during the migration?
How did you handle database connection draining during cutover?
Any KEDA scaler gotchas we should watch for?

Drop your thoughts in the comments or connect with me on LinkedIn.

Tags to Use: #kubernetes #aws #devops #eks #cloudnative #sre

Suggested Cover Image: Create a simple diagram showing ECS→EKS migration flow or use an abstract Kubernetes logo-inspired design.

DEV Community