DEV Community: Rex Zhen

Harness Engineering: The Next Evolution of AI Engineering

Rex Zhen — Wed, 08 Apr 2026 22:28:33 +0000

Harness Engineering: The Next Evolution of AI Engineering

There's a quiet but significant shift happening in how engineers work with AI. Most people are still talking about prompt engineering. Some have moved on to context engineering. But the frontier right now is something deeper: harness engineering.

And it changes not just how we build software — it changes what skills actually matter.

The Three Eras

Era 1: Prompt Engineering

This is where most engineers started. You craft the right words, the right instructions, the right examples — and the model gives you a better output.

It works. But it's fundamentally a single-turn, stateless interaction. You're still doing all the orchestration in your head.

Era 2: Context Engineering

The next step was realizing the words mattered less than the information. What does the model know when it answers? What docs, history, retrieved data, and memory are in the window?

RAG pipelines, memory systems, and knowledge bases all belong here. You're no longer just crafting prompts — you're curating what the model sees.

Era 3: Harness Engineering

This is the current frontier. Instead of controlling what the model says or sees, you design the system the model operates within.

The model becomes a component — a reasoning engine inside a larger loop. You define the skills it can use, the tools it can call, the verifiers that check its work, and the conditions under which it loops, escalates, or stops.

The shift: you're no longer writing prompts. You're writing programs — but instead of functions and libraries, the primitives are skills, tools, and MCP servers.

What a Harness Actually Looks Like

The core pattern is simple:

Skill executes → produces output → verifier judges output → loop back or advance

Each step takes structured input, runs one or more skills (a model call, a tool, an API), produces output, and hands it to a verifier. The verifier decides: good enough to move forward, or retry with new context?

The orchestrator above it all manages state, tracks history across iterations, and knows when to escalate to a human instead of looping forever.

This isn't metaphorically like a program. It structurally is one — just written in skills and tools instead of code.

A Real-World Example: Autonomous Microservice Debugging

Let me share something I actually built — a simple version of this loop in practice.

I was troubleshooting an ECS microservice that kept failing after deployment. The usual process: check the GHA pipeline, look at ECS task status, dig through Cloudwatch logs, try a fix, redeploy, repeat. Tedious, manual, and slow — especially when the failure only surfaces after a full deployment cycle.

So I wired up a harness scoped entirely within the microservice itself:

GitHub MCP — check the GHA pipeline run, read failed step output, create branches, commit fixes
AWS MCP — inspect the ECS cluster, service status, and task health
Cloudwatch — pull the ECS service logs, filter errors, surface stack traces

The loop looked like this: check GHA → check ECS → read logs → identify the issue → fix the code → commit and push → watch the next deployment → check logs again. Repeat until the service stabilized with no errors.

No manual SSH. No tab-switching between consoles. The harness held the full debug context across iterations — it knew what had already been tried — and kept tightening the loop until it was clean.

It's a narrow scope deliberately. One service, one environment, three MCP servers. But even this simple version saved hours of back-and-forth debugging and eliminated the cognitive load of tracking state across a long troubleshooting session.

This is the entry point for harness engineering in practice. Start with one service, one loop, a few well-defined skills. The pattern scales from there.

The full technical design — including the more complete vision with code edits, multi-service topology, and deployment gates — is in the appendix at the end of this article.

The Hard Problem: AI Doesn't Have the Whole Picture

My single-service harness worked well — within its scope. But that experience made the next problem obvious.

In real production systems, a microservice is never truly isolated. Every service I've worked with has upstream callers, downstream dependencies, and a surrounding ecosystem — PostgreSQL, Redis, SQS, Lambda workers, other microservices — all of which can cause your service to fail even when your service's code is perfectly fine.

I've seen this pattern more times than I can count. The symptoms show up in service A. Everyone debugs service A. Hours later someone notices that service B stopped consuming from the SQS queue two hours ago, which caused service A's queue depth to spike, which caused the memory pressure that looked like a code bug. The root cause was three hops away.

A harness that only knows about one service will do exactly what a junior engineer does: fix symptoms confidently while the real cause sits untouched elsewhere.

The full picture requires the harness to know the topology — what lives upstream and downstream, what ecosystem components exist, and which skill to use to check each one. Before diagnosing anything, it sweeps the entire dependency graph in parallel, accumulates findings from every node, and only then reasons about root cause.

That sweep might involve:

GitHub and GHA — did the deployment itself introduce the issue?
ECS tasks across multiple services — is something upstream unhealthy?
Cloudwatch logs across service boundaries — where did errors first appear?
PostgreSQL — connection pool exhaustion, slow queries, blocking locks
Redis — memory pressure, eviction policy changes, connection refusals
SQS — queue depth, dead-letter queue size, consumer lag
Lambda — throttling, cold start storms, downstream retry cascades

This is what a senior engineer does instinctively when they get paged. They don't open the failing service first — they open a mental map of everything connected to it and start ruling things out. The harness needs the same instinct, but it has to be given the map explicitly.

Building that map, keeping it accurate as the system evolves, and knowing what to include — that's not a technical problem. It's a judgment problem. And it's entirely on the engineer, not the harness.

The Boundary That Must Stay Human

One more constraint that comes directly from experience.

The harness I built operates only in lower environments, on feature branches. It checks GHA, it inspects ECS, it reads logs, it proposes and applies fixes. But it never merges to main. It never touches production. When it's satisfied that the fix holds, it opens a PR with the full debug history attached — and stops.

A human reads it, reviews the diff, and decides whether it goes forward.

This isn't just a safety rule. It reflects something real about where AI judgment currently breaks down. The harness is excellent at iteration within a defined scope — it holds state, tries things systematically, doesn't get tired. But it has no awareness of the things that make a production decision hard: what other teams are deploying this week, whether there's a compliance review pending, what the blast radius looks like at 3am on a Friday, whether the business can absorb a rollback if something goes wrong.

Those calls require context that lives outside the codebase. That context lives with people.

The machine does the iteration. The human makes the promotion decision. That division is the right design — not a temporary limitation to be engineered away.

Coding is Cheap. Engineering is Not.

AI is replacing most of the coding work. It is not replacing the engineering judgment.

Understanding infrastructure as a whole system — how failure propagates, where the real blast radius sits, what the topology actually looks like versus what the documentation says — that knowledge is becoming the scarce resource. Not syntax. Not boilerplate. Not even algorithms.

The engineers who thrive in this era are the ones who can hand a well-designed harness a well-defined problem, watch what it does, and know exactly when its confidence is outrunning its understanding. That's a harder skill to develop than writing code. And it's much harder to automate.

Appendix: Technical Design of the Microservice Debugging Harness

Skills

Skill	MCP / Tools	Purpose
GitHub Skill	GitHub MCP	Branch management, PR creation, GHA pipeline monitoring
AWS Skill	AWS MCP	ECS cluster, service, and task health verification
Cloudwatch Skill	AWS MCP	Log retrieval, error filtering, stack trace parsing
PostgreSQL Skill	Postgres MCP	Slow query analysis, connection pool status, schema verification
SQS Skill	AWS MCP	Queue depth, DLQ size, consumer lag
Redis Skill	AWS MCP	Memory usage, eviction rate, connection count
Lambda Skill	AWS MCP	Error rate, throttle count, duration, cold starts
HTTP Health Skill	HTTP tool	Upstream and downstream service health endpoints
Code Reader Skill	GitHub MCP	Fetch source files relevant to the error
Code Editor Skill	File edit + GitHub MCP	Apply fix to source code
Commit/Push Skill	GitHub MCP	Version the change on feature branch
GHA Watcher Skill	GHA MCP	Poll pipeline run, read failure logs
Deploy Waiter Skill	AWS MCP	Wait for ECS task stabilization after rollout
Load Test Skill	HTTP / Playwright	Trigger load and UI click flows against lower env

Execution Flow

Main Orchestrator (loop until healthy)
│
├── Phase 1: Full topology sweep (parallel)
│     ├── upstream health check
│     ├── self: ECS tasks, Cloudwatch errors, GHA deploy status
│     └── downstream: Postgres, Redis, SQS, Lambda, HTTP health
│
├── Reasoning: model reads combined findings, identifies root cause
│     └── decides: infra fix OR code fix
│
├── Action
│     ├── infra fix: AWS Skill → update ECS task definition, env vars
│     └── code fix: read source → edit → commit → push → watch GHA → wait for ECS
│
├── Verify Phase
│     ├── ECS tasks stable?
│     ├── Cloudwatch: error rate below threshold?
│     └── Postgres: no blocking queries?
│
├── Test Phase
│     └── Load test + UI test against lower env
│           ├── pass → open PR with debug summary → DONE
│           └── fail → append findings to debug context → loop back
│
└── Escalation condition
      └── if iterations > N → surface findings, open PR, pause for human

Shared Debug Context Object

Each iteration appends a full record so the model never repeats a fix that already failed:

{
  "service": "payments-api",
  "iteration": 3,
  "history": [
    {
      "iteration": 1,
      "diagnosis": "OOMKilled - exit code 137",
      "action": "infra fix - increased ECS memory to 2048",
      "result": "fail - still OOMKilled at 2048"
    },
    {
      "iteration": 2,
      "diagnosis": "memory leak in batch processor",
      "action": "code fix - reduced batch size 1000 → 100",
      "commit": "a3f9c12",
      "gha": "pass",
      "result": "fail - new error: DB connection timeout"
    },
    {
      "iteration": 3,
      "diagnosis": "connection pool exhausted after batch fix",
      "action": "code fix - added pg connection pool limit",
      "commit": "b7e2d45",
      "gha": "pass",
      "result": "pending"
    }
  ]
}

Service Topology Map

{
  "service": "payments-api",
  "upstream": [
    {"name": "api-gateway",  "type": "http", "skill": "http-health"},
    {"name": "frontend-app", "type": "http", "skill": "http-health"}
  ],
  "downstream": [
    {"name": "postgresql",       "type": "db",      "skill": "postgres"},
    {"name": "redis",            "type": "cache",   "skill": "aws-elasticache"},
    {"name": "sqs-payments",     "type": "queue",   "skill": "aws-sqs"},
    {"name": "lambda-worker",    "type": "compute", "skill": "aws-lambda"},
    {"name": "notification-svc", "type": "http",    "skill": "http-health"}
  ]
}

Human Gates

Gate	Condition
PR review and merge	Always — harness opens PR, human approves
Production deployment	Always — human driven
DB schema changes	Require explicit approval before harness proceeds
Iteration escalation	If harness exceeds N iterations with no progress

Rex Zhen is a Senior Site Reliability Engineer specializing in Cloud Infrastructure & AI/ML. Follow him on LinkedIn for more on cloud architecture, SRE, and the evolving role of AI in engineering.

AI #SoftwareEngineering #HarnessEngineering #DevOps #Microservices #AIEngineering #Automation #CloudArchitecture #SRE

Current AI Coding Will Never Replace Human Programmers—Hint from the Story of AlphaGo

Rex Zhen — Sat, 07 Mar 2026 05:52:59 +0000

Current AI Coding Will Never Replace Human Programmers—Hint from the Story of AlphaGo

The Two AlphaGos: A Tale of Different Origins

Let me tell you a story that changed how I think about AI and programming.

I played Go when I was young—not well, mind you. I was a terrible player who could barely keep track of my own stones, let alone plan 20 moves ahead. But even as a novice, I understood something profound about the game: it wasn't just about rules and patterns. It was about intuition, style, and thinking that transcended logic.

So when AlphaGo defeated Lee Sedol in March 2016, I watched with fascination. The headlines screamed "AI Beats Human!" and tech pundits declared the age of superhuman AI had arrived. As someone who'd struggled with Go's complexity firsthand, I knew this was huge.

But the real revelation came a year later with AlphaGo Zero. And almost nobody understood why it was fundamentally different.

AlphaGo (2016) learned from 160,000 human games spanning thousands of years of Go history. It studied master players, absorbed their opening strategies, their mid-game tactics, their endgame techniques. Then it improved through self-play. It was brilliant—Lee Sedol himself said some moves were so creative they seemed almost divine. Yet he still managed to win one game.

AlphaGo Zero (2017) started with absolutely nothing but the rules of Go. No human games. No historical data. No master strategies. Just the board, the rules, and self-play. In 3 days, it didn't just beat the original AlphaGo—it demolished it 100-0.

Top Go players who faced AlphaGo Zero said something that still gives me chills: "Against the original AlphaGo, we had a small chance if we played perfectly. Against AlphaGo Zero, we have no chance. Not even a tiny one."

What made the difference?

Not computing power. Not training time. Not algorithmic tricks.

The difference was origin. One learned from humans and carried human DNA in its thinking. The other evolved completely independently and discovered strategies humans—even masters who'd spent their entire lives studying the game—had never conceived in 3,000 years.

As a former terrible Go player, this both terrified and amazed me. Even the worst patterns I'd learned as a beginner were ultimately human patterns. AlphaGo Zero didn't have that constraint.

"Will AI Replace All Developers?"

This is the hottest question in tech right now. Every conference, every tech blog, every developer forum is debating when—not if—AI will replace human programmers.

My answer, based on the AlphaGo story? It will never happen. At least not with current AI technology.

Here's why.

The Gene of Current AI: Trained on Human Logic

Every AI coding assistant today—GitHub Copilot, ChatGPT, Claude, Cursor, Devin—shares the same fundamental DNA.

What they're all trained on:

50-70 years of human code: from assembly language to Python, from COBOL to React
Human-designed architectures: monoliths, microservices, serverless
Human programming paradigms: OOP, functional programming, procedural programming
Human patterns: design patterns, idioms, best practices
Human constraints: readability, maintainability, "clean code"
Human mistakes: technical debt, cargo cult programming, Stack Overflow copy-paste culture

This is exactly like AlphaGo learning from human games.

These AIs can write increasingly sophisticated code. They can suggest better patterns. They can catch bugs faster. They're getting exponentially better at understanding context and generating solutions.

But here's the hard truth: They're fundamentally constrained by human thinking patterns.

They can only suggest solutions that exist somewhere in their training data or are logical combinations of patterns they've seen. They think in human abstractions because that's all they know. They optimize for human values because that's what they learned.

Just like how even my terrible Go moves were still recognizably human—just bad human—current AI code is recognizably human code. Just much better human code.

Why This Means AI Can't Replace Human Developers

Think about what programming actually requires:

1. Understanding fuzzy requirements

"Make it faster" - how much faster? For whom? At what cost?
"Users don't like this feature" - which users? Why? What do they actually want?
"This feels wrong" - human intuition about product direction

2. Making judgment calls

Should we refactor now or ship fast?
Is this technical debt acceptable?
Which framework fits our team's skills?
What's the right tradeoff between performance and maintainability?

3. Navigating human systems

Team dynamics and communication
Business priorities that change weekly
Legacy systems with undocumented quirks
Political decisions disguised as technical ones

4. Defining what "correct" even means

The spec is always incomplete
Edge cases nobody thought of
Changing requirements mid-project
"I'll know it when I see it"

Current AI can't do any of this independently because they're trained on the output of these decisions (the code), not the process of making them (the human judgment).

They're like AlphaGo: excellent at executing within human-defined constraints, but unable to question the constraints themselves.

This is why AI needs human guidance—not as a temporary limitation, but as a fundamental characteristic of how they're built.

But What If... The AlphaGo Zero Moment for Programming

Now here's where it gets interesting—and scary.

What if someone built an AI that learned programming the way AlphaGo Zero learned Go? Not from human code, but from first principles.

Starting with only:

CPU instruction sets (x86, ARM, RISC-V)
Memory and hardware constraints
Mathematical logic and formal verification
Clear optimization objectives: correctness, speed, efficiency, energy

Learning through:

Self-play: generate programs, test them, learn from billions of attempts
No human code. No Stack Overflow. No GitHub.
Pure evolution of solutions from scratch

The result would be fundamentally different:

Programming paradigms we've never imagined
Abstractions that make no sense to humans but are provably superior
Code that's 100x more efficient but completely incomprehensible
Solutions that work perfectly but nobody knows why

This would be like AlphaGo Zero: alien intelligence that plays by the same rules but thinks in completely different patterns.

This could actually replace human programmers. Not assist them. Replace them.

The Real Objective Function: Human Wellbeing, Not Code Quality

Here's the realization: we're measuring the wrong thing.

The objective for programming isn't "write good code." It never was.

The objective is human benefit. Supporting people to live happy, healthy, productive lives.

In Go: Control more territory (binary win/lose)
In AlphaCode Zero: Improve human life quality (measurable through satisfaction, outcomes, engagement)

The genius of this framing: It bypasses all technical debates. We don't argue "clean code" vs "fast code." We just ask: Do humans love it? Does it improve their lives?

If yes, it's good. If no, it's bad. Binary. Clear. Measurable.

The Paradigm Shift: Software Design Disappears

In this future, the entire concept of "software design" as we know it vanishes.

Current model:

Humans have ideas → Humans design software → Humans write code → Users use it
AI just helps with the "write code" part

AlphaCode Zero future:

Humans express needs → AI creates solutions in its own way → Users benefit
AI owns both concept and implementation.

Example: You say "I want to manage my finances better."

Today: We design a budgeting app with expense tracking, categories, dashboards, React/Python/PostgreSQL.

AlphaCode Zero: Invents something completely different. Maybe not an "app" at all. Maybe a system that integrates with everything you do. Maybe interaction patterns we haven't imagined. You just know: your finances are managed, stress is reduced, and you love using it. You don't know how it works. You don't care.

The intermediate artifacts—code, databases, APIs—become implementation details the AI handles, maybe AI re-design them with a totally different schema. We do not understand, and we don't care.

AlphaCode Zero could actually work—and it would be fundamentally different from anything we have today. It doesn't just write alien code. It invents alien concepts. And humans wouldn't care that it's alien, because we'd measure only one thing: Does it make our lives better?

Is Anyone Trying to Build This?

Despite the challenges, you'd be right to suspect someone is working on this. It's too obvious not to try.

Evidence of early attempts:

1. Formal verification + AI synthesis

Research combining proof systems (Coq, Lean, Dafny) with neural networks
Generate provably correct code from mathematical specifications
Still using human-designed formal systems, though

2. Hardware/software co-design

Google's TPUs designed by AI for AI workloads
Apple's Neural Engine optimizing across hardware and software
Getting closer to "first principles" optimization

3. Neural architecture search

AIs designing neural network architectures that beat human designs
Results often look bizarre but outperform hand-crafted networks
Proof that AI-designed systems can beat human intuition

4. Skunkworks projects

OpenAI, DeepMind, Anthropic definitely have researchers exploring this
Likely classified or under NDA
Too strategically important not to investigate

But I suspect we're decades away from a true AlphaCode Zero that can handle general-purpose software development. Maybe faster, who knows.

The Real Future: Humans + AI, Not AI Instead of Humans

Question for you: Given that current AI is trained on human code, what human skills do you think are most important to develop to stay relevant? And would you trust a system written by an AlphaCode Zero if it was provably correct but incomprehensible?

AI #Programming #AlphaGo #FutureOfWork #SoftwareEngineering #MachineLearning #DeveloperLife #TechCareers #AGI #Coding #DevOps

Why AWS Still Wins (Despite GCP's Better Design)

Rex Zhen — Fri, 20 Feb 2026 23:49:02 +0000

Why AWS Still Wins (Despite GCP's Better Design)

Introduction

This is a follow-up to my previous articles: AWS SRE's First Day with GCP: 7 Surprising Differences and AWS Multi-Account Architecture: The Organizational Chaos No One Talks About.

A few months ago, I wrote enthusiastically about GCP after my first hands-on experience. The infrastructure design was cleaner. The networking model made more sense. The pricing was better. I genuinely believed GCP had solved many of AWS's fundamental architectural problems.

After actually building and running my personal ML project on GCP for several months, I need to eat some humble pie.

Here's what I've learned: Infrastructure elegance doesn't win. Ecosystem breadth does.

GCP's design is still superior from an architectural purity standpoint. But AWS remains the better choice for most organizations—and now I understand why.

The Managed Services Gap: Bigger Than I Thought

When I praised GCP's cleaner architecture, I focused on foundational services: compute, networking, storage, Kubernetes. These are areas where GCP genuinely excels.

But here's what I didn't account for: The majority of production workloads don't just need foundational services. They need the ecosystem around them.

The Services GCP Doesn't Have (That You Desperately Need)

1. Kafka: AWS MSK vs... Nothing

In AWS:
Amazon Managed Streaming for Kafka (MSK) gives you:

Fully managed Kafka clusters
Automatic patching and upgrades
Built-in monitoring with CloudWatch
Integration with AWS IAM, VPC, and KMS
Multi-AZ deployment with automatic failover
Starting at ~$200/month for production-grade setup

In GCP:
You build it yourself with open-source Kafka on GCE instances or GKE.

The reality check:
Running Kafka in-house is not impossible—SREs have been doing it for years. But it's a significant operational burden:

Cluster sizing and capacity planning
ZooKeeper management (pre-Kafka 3.x) or KRaft mode configuration
Replication and partition rebalancing
Performance tuning (JVM heap, OS parameters, disk I/O)
Security configuration (SSL/TLS, SASL authentication, ACLs)
Monitoring and alerting setup
Upgrade orchestration
Disaster recovery planning

For a dedicated SRE, this becomes a part-time to full-time job if Kafka is core to your business. For a small team, it's a distraction from product development.

AWS MSK doesn't make this complexity disappear—it just shifts the responsibility. That shift is worth hundreds of thousands in salary costs annually for most organizations.

2. Elasticsearch/OpenSearch: AWS vs DIY Hell

In AWS:
Amazon OpenSearch Service (formerly Elasticsearch Service):

Managed clusters with automatic node recovery
Built-in Kibana/OpenSearch Dashboards
Automated snapshots and point-in-time recovery
Fine-grained access control integration
Index State Management for data lifecycle
~$150/month for small production clusters

In GCP:
Roll your own Elasticsearch cluster, or use Elastic Cloud Marketplace (third-party, more expensive).

The operational nightmare:
Elasticsearch is notoriously finicky in production:

Memory management (heap sizing, JVM tuning)
Shard allocation and rebalancing strategies
Split-brain scenarios and quorum configuration
Index mapping explosions
Query performance optimization
Storage capacity management (indices grow fast)
Version upgrades (breaking changes between major versions)
Cluster state management at scale

I've seen dedicated SRE teams with 2-3 engineers just managing Elasticsearch clusters for logging and observability. It's that complex at scale.

Unless search is your core business (like Elastic.co itself), running it in-house is resource-intensive compared to using a managed service.

3. Airflow: Both Have Managed, But...

AWS:
Amazon Managed Workflows for Apache Airflow (MWAA)

Starting at ~$350/month for small environment
Integrated with AWS services (S3, Glue, EMR, etc.)

GCP:
Cloud Composer (managed Airflow)

Starting at ~$300-400/month
But consistently more expensive at scale
My testing showed GCP pricing increases faster as you add workers and schedulers

My experience:
I previously ran Airflow in-house on Docker. Both managed services are better than DIY. But AWS MWAA integrates more naturally with the broader AWS ecosystem (Lambda, Step Functions, Glue, etc.).

For GCP, if you're already heavily invested in BigQuery and Dataflow, Cloud Composer makes sense. For multi-service orchestration, MWAA edges ahead.

EKS vs GKE: The Unexpected Reversal

In my first article, I praised GKE as more mature and better integrated. After deeper experience, I've changed my mind.

First Impressions: GKE Seems Superior

Why GKE looks better on day 1:

CLI consistency: gcloud container commands mirror kubectl patterns
Earlier launch: GKE launched in 2015; EKS in 2018 (3 years later)
Native integration: GCP services integrate with Kubernetes more naturally
Mature ecosystem: More GCP-native tools built on Kubernetes primitives

As an SRE coming from AWS, GKE genuinely felt cleaner and more Kubernetes-idiomatic.

Reality Check: EKS Has Caught Up (and Pulled Ahead)

1. Add-Ons: Complexity with Purpose

In EKS:
You need to install and maintain add-ons for AWS integration:

AWS Load Balancer Controller (ALB/NLB integration)
EBS CSI Driver (persistent volumes)
EFS CSI Driver (shared file storage)
Secrets Manager CSI Driver (secret injection)
IAM Roles for Service Accounts (IRSA) (pod-level IAM)

First reaction: "Why isn't this built-in? GKE is cleaner!"

After practice both: This separation is actually better for enterprise environments:

Version control: Update add-ons independently from cluster upgrades
Rollback safety: If an add-on breaks, rollback without touching the control plane
Customization: Fork and modify add-ons for specialized needs
Debugging: Clear separation between Kubernetes issues and AWS integration issues

Reality: If you manage these through Terraform and hide the complexity in IaC, the operational overhead is minimal. After initial setup, add-ons are stable and rarely require attention.

2. Cluster Autoscaling: EKS is Cheaper

This was the biggest surprise.

Cost comparison for a production cluster:

Scenario: 10-50 nodes, scaling based on load, mix of workload types

GKE (with Google-managed node pools):
- Control plane: FREE (under 15,000 pods)
- Nodes: Standard pricing
- Node pool autoscaling: Built-in
- Typical monthly cost: $2,500-4,000

EKS (with managed node groups + Karpenter):
- Control plane: $73/month per cluster
- Nodes: Standard pricing (often cheaper than GCP equivalent)
- Managed node groups: Built-in autoscaling
- Karpenter: Advanced provisioning (free, OSS)
- Typical monthly cost: $2,200-3,500

EKS is 10-15% cheaper for equivalent workloads at scale, even with the control plane cost.

Why? Two reasons:

EC2 instance pricing is generally lower than equivalent GCE instances for compute-optimized and memory-optimized workloads
Karpenter (AWS open-source) is more efficient at bin-packing than GKE's native autoscaler

3. Karpenter: The Game-Changer

What is Karpenter?
An open-source Kubernetes cluster autoscaler built by AWS, designed to replace the standard Cluster Autoscaler.

Why it's better:

Traditional autoscaling (GKE and EKS Cluster Autoscaler):

Pre-defined node groups/pools
Fixed instance types per node pool
Scales existing node groups up/down
Can get stuck in suboptimal configurations

Karpenter:

No pre-defined node groups
Dynamically selects optimal instance type based on pending pod requirements
Provisions exactly what's needed (mix of instance types in a single scaling event)
Consolidates underutilized nodes automatically
Faster provisioning (30-45 seconds vs 3-5 minutes)

GKE alternative: GKE has improved its autoscaling, but as of 2025, it doesn't match Karpenter's flexibility and intelligence.

SRE Perspective: What Actually Matters

After running workloads on both:

GKE advantages:

✅ Cleaner initial setup
✅ Fewer moving parts (no add-ons to install)
✅ Better out-of-box experience

EKS advantages:

✅ Better cost efficiency at scale (10-15% cheaper)
✅ Karpenter enables superior autoscaling intelligence
✅ Add-on separation = better enterprise change management
✅ Broader ecosystem integration (AWS has more services)

For SRE teams managing production infrastructure at scale, EKS wins. The cost savings and Karpenter's intelligence outweigh GKE's cleaner initial experience.

My Revised Recommendation

For most organizations, AWS remains the better choice.

Not because the infrastructure is better designed (it's not).

Not because networking is simpler (it's definitely not).

But because AWS reduces the operational burden more completely through breadth of managed services.

Decision Framework

Choose GCP if:

You're building on BigQuery/Dataflow/BigTable
Your workload is data-intensive with high cross-zone transfer
You don't need Kafka or Elasticsearch
You have GCP expertise in-house

Choose AWS if:

You need managed Kafka (MSK)
You need managed Elasticsearch (OpenSearch)
You want the broadest set of managed services
You're building a complex, multi-service architecture
You need mature ML infrastructure (SageMaker)

Have you compared AWS and GCP in production? What was your experience? Did you find the managed services gap as significant as I did? Let me know in the comments.

This article is part of a series exploring practical cloud architecture. Check out the previous articles for more context on AWS multi-account architecture and GCP's design advantages.

Connect with me on LinkedIn: https://www.linkedin.com/in/rex-zhen-b8b06632/

I share insights on cloud architecture, SRE practices, and honest takes on cloud platforms. Let's connect!

AWS #GCP #CloudEngineering #SRE #DevOps #Kubernetes #CloudComputing #InfrastructureAsCode #CostOptimization #Kafka #Elasticsearch

CCM Plugin: Claude Code Memory Management

Rex Zhen — Wed, 18 Feb 2026 06:03:45 +0000

CCM Plugin: Session Memory for Claude Code That Works Everywhere

The Evolution

A few weeks ago, I wrote about solving Claude Code's memory problem with a skill/hook/script combination that provided long-term and short-term memory management:

AI Memory Problem: Long-term and Short-term Memory with Hooks and Skills

That solution worked perfectly—automatic session summaries, context loading on startup, searchable history. But it had one significant limitation: it was per-project.

Every project needed its own copy of the scripts, hooks, and skills in its .claude directory. Update a feature? Copy it to 10+ projects. Fix a bug? Update everywhere manually. New project? Copy the whole setup again.

That's when I realized: This should be a plugin, not a collection of per-project scripts.

From Per-Project to Universal Plugin

The transformation was about making the same solution work generically across all projects:

Before (per-project approach):

Copy scripts/hooks/skills to each project's .claude directory
Maintain multiple copies of the same code
Manual updates across all projects

After (CCM plugin):

Single plugin installation in ~/.claude/plugins/ccm/
Works automatically in all projects
Update once, benefits everywhere

The plugin maintains the same context-aware storage philosophy—sessions are saved to project-specific directories (.claude/sessions/) when you're in a project, or to global storage (~/.claude/sessions/) otherwise. It just detects the context automatically now.

Core Features

Session persistence:

Automatically saves full conversation transcripts on exit
Generates AI-powered summaries (supports Anthropic API and AWS Bedrock)
Loads previous session context on startup
Maintains searchable session history

Storage management:

Context-aware: project-specific or global storage
Configurable limits with automatic cleanup
Smart retention: keeps recent summaries, removes old sessions

User commands:

/ccm-save - Manual save with custom notes
/ccm-history - Browse and search past sessions

For detailed setup and configuration, see the GitHub repo README.md and QA.md.

What Could Be Better: Two Things

1. Auto-Display on Session Start

The original design: when you start a new Claude Code session, you immediately see the previous session summary—what you worked on, where you left off, what's next.

Current state: Partially works. The summary loads into Claude's context, but Claude doesn't always display it automatically. It worked perfectly with per-project scripts, but became inconsistent after moving to the plugin architecture.

According to official docs, SessionStart hooks don't support direct user output. But it worked before the migration. The plugin system is still maturing, and I suspect the hooks API is evolving. My current workaround uses instructions to tell Claude to display the summary, but it's not 100% reliable.

2. Vector Database Storage

Current implementation: Sessions are JSONL files, summaries are Markdown, search uses grep.

Potential improvement:

Replace file storage with vector database (ChromaDB, Qdrant, etc.)
Enable semantic search instead of keyword matching
Reduce token usage for context loading
Better relationship mapping between sessions

Why I haven't done it yet: The current solution works well enough. File-based storage is simple, debuggable, and version-controllable. Token usage hasn't been a bottleneck. I'll revisit if it becomes a problem, but for now, YAGNI (You Aren't Gonna Need It).

The Bottom Line

If you're using Claude Code across multiple projects and want session continuity without manually copying scripts everywhere, CCM might solve your problem.

GitHub repo: https://github.com/rexzhen/ccm

The repo includes detailed installation instructions, configuration options, troubleshooting guides, and technical implementation details in README.md and QA.md.

It's open source. The core functionality—session persistence, AI summaries, automatic context detection—works as designed. The auto-display on session start is the one rough edge, but the plugin delivers what it promises: universal session memory without per-project maintenance.

If you figure out how to make SessionStart hooks reliably display output to users in plugin mode, please open an issue. I'd love to close that gap.

Do you use Claude Code for multiple projects? How do you handle context continuity across sessions?

ClaudeCode #AI #DevTools #Productivity #OpenSource #ClaudeAI #DeveloperExperience #SessionManagement

Where Do You Stand in the AI Era: Understanding User Patterns

Rex Zhen — Wed, 04 Feb 2026 23:59:39 +0000

Where Do You Stand in the AI Era: Understanding User Patterns

Introduction

As AI tools have become integrated into professional workflows in 2026, distinct patterns of usage have emerged across different user groups. This article documents the observable tiers of AI adoption, from non-users to those building custom automation systems.

The goal is to provide a factual overview of how different groups are currently using AI technology in their work, the characteristics that define each tier, and the technical requirements that distinguish them.

The 6 Tiers of AI Users (2026 Observations)

Tier 0: Non-Users (~30-40% of working professionals)

Profile:

Have not integrated ChatGPT or similar AI tools into their regular workflow
May have experimented briefly but did not continue usage
Common reasons include privacy concerns, skepticism about utility, or perception that AI is not relevant to their field

Usage patterns:

No regular interaction with AI chat interfaces
Work processes remain unchanged from pre-AI era
Rely on traditional tools and methods for research and content creation

Tier 1: Casual Prompters (~50% of AI users)

Profile:

Use ChatGPT/Claude sporadically, typically a few times per week
Often utilize publicly shared prompts or simple queries
Primary use cases include email drafting, brainstorming, concept explanation, and basic code generation

Usage patterns:

Session-based interaction: open tool → enter query → copy response → close session
Each interaction is independent with no conversation continuity
Minimal or no customization of tool settings
Functionality used is similar to enhanced search engines

Typical queries:

Marketing professionals: generating social media content
Students: requesting explanations of technical concepts
Developers: obtaining code snippets for specific functions
Managers: drafting professional communications

Technical characteristics:

No file uploads or document sharing
No use of context persistence features
Limited iteration on responses
No integration with existing workflows

Tier 2: Daily AI Companions (~15-20% of AI users)

Profile:

AI tools are integrated into daily work routines
Maintain long-running conversations spanning days or weeks
Utilize AI for complex problem-solving and iterative work
Share documents, code, and data files for analysis

Usage patterns:

Keep AI interfaces open throughout the workday
Return to existing conversation threads repeatedly
Upload files and reference materials
Engage in multi-turn discussions on single topics
Use AI for decision exploration and analysis

Observed behaviors by role:

Software engineers:

Upload codebase context for architectural discussions
Request code review analysis
Maintain project-specific conversation threads

Content creators:

Use single threads for brainstorming through final edits
Upload reference materials and style examples
Iterate on drafts within persistent conversations

Product managers:

Share PRD documents and user feedback data
Discuss product trade-offs and prioritization
Generate requirements documentation

Researchers:

Upload academic papers for analysis
Request synthesis across multiple sources
Generate research questions and hypotheses

Technical characteristics:

Leverage conversation history and context
Upload files (PDFs, CSVs, code, documents)
Utilize Projects (Claude) or Custom GPTs (ChatGPT) features
Iterate on outputs through refinement requests
Manual transfer of outputs to other applications

Constraints:

No direct integration with email, calendar, or project management tools
Requires manual copy-paste between AI and other applications
Context must be re-established across different sessions or platforms
Limited to chat interface interactions

Tier 3: AI Agent Users (~5-10% of AI users)

Profile:

Use AI systems with execution capabilities beyond text generation
Grant AI access to local file systems and development tools
Common tools: Claude Code, Cursor, Replit Agent, Windsurf, Zed with AI

Usage patterns:

AI directly reads and writes files on local machines
AI executes commands and runs tests
AI maintains context of entire codebases or projects
Interactive approval workflows for AI-generated changes

Observed workflows:

Software development:

AI reads existing code to understand structure
AI identifies implementation points for new features
AI generates code changes across multiple files
AI executes tests to verify changes
Human review and approval before committing
AI handles git operations

Code editing environments:

Real-time AI code suggestions during typing
Project-wide context awareness
Multi-file refactoring capabilities
Pattern matching based on existing codebase

Data analysis:

AI reads CSV files or connects to databases
Generates analysis code (pandas, SQL)
Creates visualizations
Exports formatted results

Technical characteristics:

Workflow comparison:

Tier 2 workflow (9 steps):

Request code from AI chat interface
Copy generated code
Paste into development environment
Identify bugs during testing
Copy error messages
Return to AI chat
Receive corrected code
Re-paste and test
Manually commit changes

Tier 3 workflow (7 steps):

Request feature from AI agent
AI asks clarifying questions
AI generates multi-file changes
AI runs test suite
AI presents changes for review
Human approves
AI commits changes

Technical requirements:

Configuration of file system permissions
API key management
MCP (Model Context Protocol) server setup
Subscription costs ($20-40/month typical)
Understanding of agent execution models

Skills that facilitate adoption:

File system navigation (paths, directories, file types)
Command-line interface familiarity
API and webhook concepts
Debugging methodology
Programming fundamentals

Demographic patterns:
Users with software development backgrounds demonstrate faster adoption due to existing familiarity with:

Step-by-step execution models
Tool ecosystem architecture (plugins, MCP servers, skills, hooks)
Troubleshooting methodologies

Tier 4: AI Orchestrators (~1-2% of AI users)

Profile:

Build custom AI workflows and automation systems
Deploy multiple specialized AI agents for different functions
Create custom tools including MCP servers, custom GPTs, and skills
Implement semi-autonomous AI processes

Usage patterns:

Chain multiple AI API calls in sequences
Integrate AI with automation platforms (n8n, Make.com, Zapier)
Build custom MCP (Model Context Protocol) integrations
Schedule AI agents to run on triggers or time intervals
Direct API usage rather than chat interfaces

Observed implementations:

Custom MCP server integration:

Connects Claude to internal company APIs
Enables database queries
Accesses monitoring logs (e.g., Datadog)
Creates project management tickets
Triggers deployment processes

Content monitoring automation:

RSS feed monitoring
AI summarization of new content
Automated outline generation
Notification systems (Slack, email)
Conditional publishing workflows

Research automation:

Daily API polling (e.g., arXiv)
Abstract analysis
Knowledge base updates
Relevance filtering
Digest generation

CI/CD integration:

GitHub Actions with AI API calls
Automated code review on pull requests
Comment generation with suggestions
Test coverage analysis

Technical characteristics:

Multi-stage AI pipelines (output of one AI becomes input to another)
Event-driven AI execution
Scheduled autonomous processes
Direct API integration
Custom infrastructure development

Technical requirements:

Programming skills (Python, JavaScript, bash)
API architecture knowledge (REST, webhooks, OAuth)
Automation platform experience
DevOps fundamentals (cron, CI/CD, monitoring)
Prompt engineering for automated contexts

Operational considerations:

Requires ongoing maintenance
AI reliability limitations necessitate monitoring
API costs scale with usage ($50-200/month range)
Automation debugging and error handling
System integration complexity

Tier 5: Autonomous AI (Conceptual Stage)

Theoretical capabilities:

AI functioning as independent team member
High-level goal execution: "Increase conversion rate by 10%"
Multi-day or multi-week project completion
Minimal supervision requirements
Independent handling of unexpected situations
True autonomous decision-making

Current state (2026):

Not yet achieved at production scale or reliability
Demonstration systems exist (e.g., Devin for coding)
Current implementations require:
- Regular human oversight
- Approval checkpoints for significant decisions
- Manual course correction
- Safety guardrails

Technical limitations:

AI output reliability issues (hallucinations, logic errors)
Limited common-sense reasoning
Difficulty with novel or undefined situations
Production-level reliability not yet achieved

Observed constraints (2026):

Systems marketed as "autonomous" require human check-ins every few hours
Success in constrained domains (specific code generation, defined data analysis)
Limited effectiveness on open-ended, complex, extended projects

Timeline estimation: Mainstream production readiness estimated 3-5+ years from current state.

Distribution of AI Users (2026 Estimates)

Usage tier breakdown:

Tier	% of Working Professionals	Description
Tier 0	30-40%	Non-users
Tier 1	50% of AI users (~30% of workers)	Casual Prompters
Tier 2	15-20% of AI users (~10% of workers)	Daily Companions
Tier 3	5-10% of AI users (~3-5% of workers)	Agent Users
Tier 4	1-2% of AI users (~0.5-1% of workers)	Orchestrators
Tier 5	<0.1%	Conceptual/not yet operational

Distribution summary:

Approximately 40% of working professionals have minimal AI usage
Approximately 30% use AI sporadically for specific tasks
Approximately 10% have integrated AI into daily workflows
Approximately 3-5% use AI agents with execution capabilities
Less than 1% build custom AI automation systems

Summary

2026 AI adoption landscape:

Approximately 70% of working professionals are at Tier 0-1 (non-users or casual users)
Tier 2-3 users (daily companions and agent users) represent roughly 13-15% of the workforce
Tier 4 (orchestrators) and Tier 5 (autonomous systems) remain specialized categories

Key differentiators across tiers:

Interaction model:

Tiers 0-2 interact through conversational interfaces
Tiers 3-4 grant execution permissions and build automated workflows
Tier 5 represents theoretical autonomous operation

Technical requirements:
The transition from conversational AI usage to agent-based systems requires:

File system and command-line knowledge
Permission and security management
Understanding of agent execution models
API and integration concepts

Background influence:
Software development experience correlates with faster adoption of Tier 3-4 capabilities due to existing familiarity with:

Execution models and step-by-step processing
Tool ecosystems and integration patterns
Troubleshooting methodologies

Current state:
As of 2026, the most significant adoption growth is occurring in the Tier 2 → Tier 3 transition, where AI capabilities shift from text generation to action execution. This transition represents a fundamental change in interaction model rather than an incremental feature addition.

AI #ArtificialIntelligence #TechTrends #AIAgents #ChatGPT #Claude #AgenticAI #DigitalTransformation #TechnologyAdoption

Claude Code memory management: Long-Term and Short-Term Memory with Hooks and Skills

Rex Zhen — Sun, 25 Jan 2026 05:51:41 +0000

Claude Code Memory Management: Long-Term and Short-Term Memory with Hooks and Skills

The Challenge: AI Amnesia

When working with AI assistants like Claude Code, you've probably experienced this frustrating pattern:

You start a new session
The AI asks questions you've already answered before
Previous decisions and context are lost
You waste time re-explaining the same background information

This is the AI memory problem. Most AI conversations are stateless - each session starts with a blank slate. While AI models have impressive context windows, they still face two fundamental memory constraints:

Short-Term Memory (Context Window Limits)

Even with large context windows (100K-200K tokens), a single conversation can exceed these limits when:

Working on complex, multi-hour projects
Reviewing large codebases with many files
Accumulating dozens of tool calls and outputs
Discussing detailed technical specifications

When you hit these limits, you get the dreaded error:

API Error: 400 Input is too long for requested model.

Long-Term Memory (Session Persistence)

Between sessions, AI has no memory at all. When you:

Close and reopen the CLI
Start a new day's work
Switch between projects

All context from previous conversations is lost. The AI doesn't remember:

Your project structure and architecture
Previous decisions and why they were made
Bugs you've encountered and solved
Code patterns and conventions you established
Your preferences and workflow

The Solution: Hooks, Skills, and Persistent Memory

The solution is a three-tier memory system that mimics human memory:

1. Session Summaries (Long-Term Memory)

Create a session save mechanism that captures conversation history in permanent storage:

# .claude/scripts/save_session.sh
#!/bin/bash
SESSIONS_DIR=".claude/sessions"
TIMESTAMP=$(date +"%Y-%m-%d_%H%M")
SESSION_FILE="${SESSIONS_DIR}/session_${TIMESTAMP}.md"

# Save full transcript with timestamp
claude sessions export > "$SESSION_FILE"

# Update latest session pointer
cp "$SESSION_FILE" "${SESSIONS_DIR}/latest_session.md"

# Generate summary using AI
claude sessions summarize > "${SESSIONS_DIR}/latest_summary_short.md"

This creates a searchable archive of all your work, organized by date and project.

2. Automatic Memory Loading (Session Startup)

Use a SessionStart hook to automatically load context when you begin work:

// .claude/config.json
{
  "hooks": {
    "SessionStart": {
      "startup": {
        "command": ".claude/scripts/load_latest_summary.sh",
        "background": false
      }
    }
  }
}

# .claude/scripts/load_latest_summary.sh
#!/bin/bash
SUMMARY_FILE=".claude/sessions/latest_summary_short.md"

if [ -f "$SUMMARY_FILE" ]; then
  echo "================== Previous Session Context =================="
  cat "$SUMMARY_FILE"
  echo "=============================================================="
else
  echo "No previous session found. Starting fresh."
fi

Now every session starts with a brief recap of where you left off.

3. On-Demand Memory Recall (Skills)

Create custom skills for memory operations:

# .claude/skills/save-session/SKILL.md
---
name: save-session
description: "Saves current conversation transcript and creates summary"
trigger: /save-session | /ss
---

Execute: .claude/scripts/save_session.sh

Then respond: "Session saved to .claude/sessions/session_[timestamp].md"

# .claude/skills/load-previous-summary/SKILL.md
---
name: load-previous-summary
description: Loads previous session summary for context
trigger: /load | /recall
---

Execute: .claude/scripts/load_latest_summary.sh

Then summarize the loaded context for the user.

Now you can use natural commands:

/save-session or /ss - Save current work
/load or /recall - Recall previous context
The AI can also invoke these proactively when needed

Implementation Architecture

Here's the complete memory system architecture:

┌─────────────────────────────────────────────────────────────┐
│                      AI Session (Current)                    │
│  ┌────────────────────────────────────────────────────┐     │
│  │  Active Conversation (Short-Term Memory)          │     │
│  │  - Current task context                            │     │
│  │  - Recent messages and tool calls                  │     │
│  │  - Limited by context window                       │     │
│  └────────────────────────────────────────────────────┘     │
│                           │                                  │
│                           │ Save on exit/demand              │
│                           ▼                                  │
└─────────────────────────────────────────────────────────────┘
                            │
                            │
┌───────────────────────────▼─────────────────────────────────┐
│              Persistent Storage (Long-Term Memory)          │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Session Archive (.claude/sessions/)                 │   │
│  │  - session_2026-01-24_1430.md  (full transcript)    │   │
│  │  - session_2026-01-24_1600.md  (full transcript)    │   │
│  │  - session_2026-01-24_1820.md  (full transcript)    │   │
│  │  - latest_session.md            (most recent full)   │   │
│  │  - latest_summary_short.md      (condensed version)  │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Project Memory (session-notes/)                     │   │
│  │  - vibe67-memory.md  (manually curated notes)       │   │
│  │  - Key decisions and architecture                    │   │
│  │  - Gotchas and learnings                             │   │
│  └──────────────────────────────────────────────────────┘   │
└──────────────────────────────────────────────────────────────┘
                            │
                            │ Load on startup
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   New AI Session (Restored)                  │
│  ┌────────────────────────────────────────────────────┐     │
│  │  Previous Context Loaded                           │     │
│  │  ✓ Project structure understood                    │     │
│  │  ✓ Recent work summarized                          │     │
│  │  ✓ Key decisions recalled                          │     │
│  │  ✓ Ready to continue where you left off            │     │
│  └────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────┘

Real-World Example

Let's see this system in action:

Day 1: Initial Work

You: I'm building a video generator that downloads classical music and creates
     YouTube videos. I need to avoid copyright issues.

Claude: [Works on the problem, creates scanner tool, tests files...]

You: /save-session

Claude: ✓ Session saved to .claude/sessions/session_2026-01-24_1830.md

Day 2: Continuation

[Session starts automatically]

System: ================== Previous Session Context ==================
Working on vibe67 video generator project. Created YouTube-safe audio scanner
to pre-screen MP3 files for copyright risk. Discovered Classicals.de hosts
modern copyrighted performances despite claiming "public domain". Scanner
checks metadata for recording year, copyright statements, and DAW encoders.
Next: Run scanner on Chopin collection and find alternative PD sources.
==================================================================

You: Let's continue with the scanner

Claude: I'll run the YouTube-safe audio scanner on the Chopin collection we
        discussed yesterday. [Continues work seamlessly...]

Mid-Session: Context Overflow Prevention

[After many tool calls and file reads]

Claude: I'm approaching context limits. Let me save current progress.
        [Invokes save-session skill automatically]

        Now I'll load just the summary to continue with a fresh context window.
        [Loads latest_summary_short.md instead of full transcript]

Benefits of This System

1. No More Repeated Questions

The AI remembers your project structure, conventions, and previous decisions.

2. Seamless Multi-Day Projects

Pick up exactly where you left off, days or weeks later.

3. Context Window Management

Automatic summarization prevents "input too long" errors on complex projects.

4. Searchable History

Full transcripts are saved with timestamps - search past sessions for solutions.

5. Learning from History

The AI can reference past mistakes, gotchas, and successful patterns.

6. Automatic and Manual Control

Hooks provide automatic save/load
Skills give you manual control when needed
You decide when to save important milestones

Advanced: Hierarchical Memory

For complex projects, use a tiered memory structure:

.claude/sessions/
  ├── latest_summary_short.md       # 500 tokens - Quick context
  ├── latest_summary.md             # 2000 tokens - Detailed recap
  ├── latest_session.md             # Full transcript
  └── manual_summary_2026-01-24.md  # Hand-crafted context

session-notes/
  └── vibe67-memory.md              # Curated project knowledge

The AI loads different levels based on need:

Quick tasks: Load short summary only (saves tokens)
Continue work: Load detailed summary
Complex debugging: Reference full session transcript
Long-term recall: Search curated project memory

Implementation Tips

1. Keep Summaries Focused

Don't save everything - extract the essential context:

Current goals and progress
Key decisions and rationale
Active bugs or blockers
File paths and important locations
Next planned steps

2. Use Timestamps

Date-based filenames make it easy to find specific sessions:

session_2026-01-24_1430.md  # 2:30 PM session
session_2026-01-24_1820.md  # 6:20 PM session

3. Automatic Hook Configuration

Set hooks in .claude/config.json so memory loading is automatic:

{
  "hooks": {
    "SessionStart": {
      "startup": ".claude/scripts/load_latest_summary.sh"
    },
    "Stop": {
      "autosave": ".claude/scripts/save_session.sh"
    }
  }
}

4. Skill Triggers

Use short, memorable triggers:

/ss → save session
/load → load previous summary
/recall → search session archive

5. Compression Strategy

As sessions accumulate, compress older ones:

# Keep full transcripts for 7 days
# After 7 days, keep only summaries
# After 30 days, archive to compressed format

Handling the Token Budget

Even with unlimited conversation length, each API call has token limits. The memory system handles this:

SessionStart Hook: Loads compact summary (~500 tokens)
During work: Full context in active window
Before limit: Auto-save and restart with summary
On-demand: /recall loads specific past context when needed

This creates the illusion of infinite memory while respecting API constraints.

Code Example: Complete Setup

Here's everything you need:

1. Directory structure:

mkdir -p .claude/{sessions,scripts,skills/{save-session,load-previous-summary}}

2. Save script:

# .claude/scripts/save_session.sh
#!/bin/bash
set -e
SESSIONS_DIR=".claude/sessions"
mkdir -p "$SESSIONS_DIR"

TIMESTAMP=$(date +"%Y-%m-%d_%H%M")
SESSION_FILE="${SESSIONS_DIR}/session_${TIMESTAMP}.md"

echo "Saving session to $SESSION_FILE..."

# Export conversation (implement based on your CLI's export method)
claude sessions export > "$SESSION_FILE"

# Update latest pointers
cp "$SESSION_FILE" "${SESSIONS_DIR}/latest_session.md"

# Generate short summary (implement summarization)
cat "$SESSION_FILE" | claude summarize --max-tokens 500 > \
  "${SESSIONS_DIR}/latest_summary_short.md"

echo "Session saved successfully"

3. Load script:

# .claude/scripts/load_latest_summary.sh
#!/bin/bash
SUMMARY_FILE=".claude/sessions/latest_summary_short.md"

if [ -f "$SUMMARY_FILE" ]; then
  echo "================== Previous Session Context =================="
  cat "$SUMMARY_FILE"
  echo "=============================================================="
  exit 0
else
  echo "No previous session found"
  exit 0
fi

4. Hook configuration:

// .claude/config.json
{
  "hooks": {
    "SessionStart": {
      "startup": {
        "command": ".claude/scripts/load_latest_summary.sh",
        "background": false,
        "showOutput": true
      }
    },
    "Stop": {
      "autosave": {
        "command": ".claude/scripts/save_session.sh",
        "background": true
      }
    }
  }
}

5. Skills:

# .claude/skills/save-session/SKILL.md
---
name: save-session
description: Immediately saves conversation transcript and summary
trigger: /save-session | /ss
---

Execute this command:
.claude/scripts/save_session.sh

After success, respond:
"✓ Session saved to .claude/sessions/session_[timestamp].md"

Conclusion

The AI memory problem isn't unsolvable - it just requires thinking about memory the same way operating systems do:

RAM (Short-term): Active conversation context
Disk (Long-term): Session transcripts and summaries
Cache (Recall): On-demand loading of specific context
Compression: Summarization to manage storage

With hooks for automatic save/load and skills for manual control, you create a persistent memory layer that makes AI assistants truly useful for long-term projects.

Stop re-explaining your project every session. Start working with an AI that remembers.

How I Turned 30 Minutes of YouTube Video Prep Into 2 Minutes With AI Agent Skills

Rex Zhen — Sun, 25 Jan 2026 05:50:15 +0000

How I Turned 30 Minutes of YouTube Video Prep Into 2 Minutes With AI Agent Skills

The Problem: Repetitive Manual Work Every Week

I create YouTube videos twice a week. Before AI automation, my workflow looked like this:

Every single video:

1. Create project folder structure (5 min)
2. Organize images into the right folders (5 min)
3. Find and copy audio files (3 min)
4. Verify thumbnail exists (2 min)
5. Check audio duration (3 min)
6. Convert images to 1920x1080 (5 min)
7. Run video generation script with correct parameters (5 min)
8. Verify output and move files (2 min)

Total: ~30 minutes of setup per video

The real pain: If I had to pause and come back later, I'd forget where I was in the process and have to re-explain everything to the AI.

The Solution: AI Agent Skills System

I spent one weekend building a custom AI agent skills system using Claude Code. The result? 30 minutes compressed to 2 minutes.

The Core Concept

Instead of manually running each step, I created AI skills that:

✅ Know my folder structure
✅ Remember my video generation workflow
✅ Execute multiple scripts in sequence
✅ Validate everything automatically
✅ Maintain context across sessions

System Architecture

File System Layout

~/.claude/skills/                    # Personal skills (all projects)
└── generate-video/                  # Main video generation skill
    ├── SKILL.md                     # Orchestration logic
    ├── README.md                    # Documentation
    └── WORKFLOW.md                  # Visual diagrams

/Volumes/SSD/vibe67/scripts/         # My video generation scripts
├── scripts_generate_video/
│   ├── get_mp3_duration.py          # Audio duration calculator
│   └── auto_video_creator.py        # Video generator
│
└── scripts_download_images/
    └── resize_to_youtube_image.py   # Image converter

The Skill Workflow

Single command: /generate-video <folder-path> <video-name>

What happens automatically:

Step 1: Validation
├─ Check folder exists
├─ Verify images present (jpg, png)
├─ Verify audio files present (mp3, m4a, wav)
└─ Ensure thumbnail* file exists (REQUIRED)

Step 2: Audio Analysis
├─ Run get_mp3_duration.py
├─ Calculate total hours
└─ Display: "Total: X.XX hours"

Step 3: Image Processing
├─ Run resize_to_youtube_image.py
├─ Convert ALL images (except thumbnail) to 1920x1080
└─ Overwrite originals (in place)

Step 4: Video Generation
├─ Run auto_video_creator.py
├─ Pass: folder path, video name, duration
├─ Output: /autocreated/{video-name}.mp4
└─ Confirm: File size, location, ready for upload

Total time: ~2 minutes (mostly video encoding)

Key Design Decisions

1. Personal Skills vs Project Skills

I use personal skills (~/.claude/skills/) because:

Available in ANY project directory
Don't need to recreate for each video project
Consistent workflow across all videos

2. Skills, Not Documentation Lookup

When to create skills:

✅ Repetitive workflows (video generation)
✅ Multi-step automation
✅ Fixed paths and procedures

When to just ask AI:

❌ API documentation (Slack SDK, AWS, GCP)
❌ One-time lookups
❌ Public documentation

Why? Skills have token overhead but save time when you repeat the same process regularly.

3. In-Place Image Conversion

Original design: Create separate _youtube_1080p folder
Problem: Extra disk space, manual cleanup, confusing paths

Solution: Modified resize_to_youtube_image.py to overwrite in place

Simpler workflow (single folder throughout)
No duplicate files
Less disk space usage

4. Session Memory Through Skills

Problem: "Where was I in the process?"

Solution: Skills encode the entire workflow

No need to remember steps
No need to re-explain to AI
Just run /generate-video and it knows everything

Real-World Impact

Before AI Skills

Every video (2x per week):
- 30 minutes manual work
- High chance of mistakes
- Forgot where I left off if interrupted
- Had to document steps manually

After AI Skills

Every video:
- 2 minutes (just one command)
- Zero mistakes (automated validation)
- Can pause and resume anytime
- Skills ARE the documentation

Time saved per video: 28 minutes
Videos per week: 2
Time saved per week: 56 minutes
Time saved per month: ~4 hours

Bottom Line

Before: 30 minutes of repetitive work, twice a week
After: 2 minutes, fully automated, never forget where I was

Cost: One weekend of setup
Savings: 4 hours per month, forever

The real win: AI remembers my entire workflow so I don't have to.

AI #Automation #YouTube #DevOps #Productivity #ClaudeCode #ContentCreation

Service Mesh in 2026: The Landscape Has Changed (Istio Ambient Mode Update)

Rex Zhen — Mon, 19 Jan 2026 00:33:46 +0000

Service Mesh in 2026: The Landscape Has Changed (Istio Ambient Mode Update)

A Confession: My Previous Post Was Already Outdated

Last week, I published an article about why service mesh never took off, based on my experiences from years ago. The challenges I described were real:

30-90% infrastructure cost increase with sidecar-based architectures
Per-pod sidecar overhead (50 pods = 50 extra containers)
Complex upgrades and troubleshooting that killed adoption

Days after publishing, I discovered the landscape had fundamentally changed. The service mesh story I told was based on outdated knowledge from 2017-2023. I considered updating that post, but decided to leave it as-is (a historical perspective) and write this follow-up instead.

The truth: While I was away from the service mesh world, Istio evolved dramatically. What I experienced years ago is no longer the reality in 2026.

What Changed: AWS Gave Up, Istio Evolved

1. AWS App Mesh: Deprecated

In 2024, AWS announced the deprecation of App Mesh, their managed service mesh offering. This validates exactly what I wrote in my previous post—the economics didn't work.

AWS's reasoning:

High operational overhead for customers
Better alternatives emerged (AWS observability services, application-level instrumentation)
Limited adoption outside large enterprises

Key insight: Even AWS, with infinite resources, couldn't make the traditional sidecar model economically viable for most customers.

2. Istio Ambient Mode: The Game Changer

While AWS retreated, Istio made a bold architectural shift. Ambient mode (GA in 2024/2025) eliminates per-pod sidecars entirely, replacing them with:

Node-level proxies (ztunnel): 1 DaemonSet pod per node instead of N sidecars
Optional service-level proxies (Waypoint): Deployed only for services needing advanced L7 features

This is Kubernetes-native innovation—only viable in K8s environments (EKS, GKE, self-managed). It fundamentally changes the cost equation.

The Ambient Architecture: Node-Level vs Pod-Level

Before: Traditional Sidecar Mode

┌─────────────────────────────────────────┐
│          Your Application Pod            │
│  ┌──────────────┐    ┌──────────────┐  │
│  │ App          │    │ Envoy Sidecar│  │
│  │ Container    │◄───┤ Proxy        │  │
│  │              │    │              │  │
│  │ 500m CPU     │    │ 100m CPU     │  │ ← 20% overhead PER POD
│  │ 512Mi RAM    │    │ 128Mi RAM    │  │
│  └──────────────┘    └──────────────┘  │
└─────────────────────────────────────────┘

50 pods × 100m CPU = 5 vCPU overhead
50 pods × 128Mi = 6.4GB overhead

Problems:

Every pod needs sidecar = 100 containers for 50 apps
Sidecar upgrades = restart all application pods
Debug complexity = app logic + sidecar config

After: Ambient Mode (ztunnel + Waypoint)

Layer 1: ztunnel (L4 - TCP/Connection Level)

┌─────────────────────────────────────────────────────────────┐
│                    Kubernetes Node                           │
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│  │ App Pod 1    │  │ App Pod 2    │  │ App Pod 3    │     │
│  │ (No sidecar!)│  │ (No sidecar!)│  │ (No sidecar!)│     │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘     │
│         │                  │                  │              │
│         └──────────────────┼──────────────────┘              │
│                            ▼                                 │
│                   ┌────────────────┐                         │
│                   │    ztunnel     │ ◄─ DaemonSet           │
│                   │  (L4 Proxy)    │    (1 per node)        │
│                   │                │                         │
│                   │ 100m CPU       │                         │
│                   │ 128Mi RAM      │                         │
│                   └────────────────┘                         │
└─────────────────────────────────────────────────────────────┘

3 nodes × 100m CPU = 0.3 vCPU overhead
3 nodes × 128Mi = 0.4GB overhead

Savings: 94% reduction in proxy overhead!

ztunnel provides:

✅ mTLS encryption between all services (zero-trust networking)
✅ L4 connection metrics (bytes, connections, TCP stats)
✅ Service authentication and authorization
✅ Kiali service graph visualization
❌ No L7 features (circuit breakers, retries require Waypoint)

Layer 2: Waypoint (L7 - HTTP/gRPC Level) - Optional

┌─────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                        │
│                                                              │
│  ┌──────────────┐       ┌──────────────┐                   │
│  │ Frontend     │──────►│   Waypoint   │──────►Backend     │
│  │ Service      │       │   Proxy      │       Service     │
│  │ (10 pods)    │       │              │       (5 pods)    │
│  └──────────────┘       │ 200m CPU     │       └──────────┘│
│                         │ 256Mi RAM    │                   │
│                         │              │                   │
│                         │ • Circuit    │                   │
│  Deploy only for        │   breakers   │                   │
│  services needing       │ • Retries    │                   │
│  advanced L7 features   │ • Timeouts   │                   │
│                         │ • Tracing    │                   │
│                         └──────────────┘                   │
└─────────────────────────────────────────────────────────────┘

1 Waypoint serves 10 frontend pods (10:1 ratio)
Not 10 sidecars serving 10 pods (1:1 ratio)

Savings: 90% reduction even with L7 features!

Waypoint adds:

✅ Circuit breakers (prevent cascading failures)
✅ HTTP-level retries, timeouts, traffic splitting
✅ Request-level metrics (latency, status codes, throughput)
✅ Distributed tracing (Jaeger integration)
✅ Canary deployments, A/B testing

Key strategy: Deploy Waypoints only for critical services (20% of apps), keep ztunnel-only for the rest (80% of apps).

The Complete Observability Stack (All Pods in Your Cluster)

One major clarification: All Istio components run as pods in your GKE/EKS cluster. Nothing is external.

Deployment Overview

kubectl get pods --all-namespaces

NAMESPACE       NAME                               TYPE
─────────────────────────────────────────────────────────────
istio-system    ztunnel-abc123                     DaemonSet
istio-system    ztunnel-def456                     DaemonSet
istio-system    ztunnel-ghi789                     DaemonSet
istio-system    istiod-7d4b9c8f9-abc123           Deployment
google-boutique waypoint-frontend-abc123          Deployment
google-boutique waypoint-checkout-def456          Deployment
monitoring      prometheus-0                       StatefulSet
monitoring      prometheus-1                       StatefulSet
monitoring      grafana-5b7d8c9f4-abc123          Deployment
observability   jaeger-all-in-one-abc123          Deployment
istio-system    kiali-abc123                       Deployment

Resource Requirements

Component	Deployment Type	Replicas	CPU	Memory	Purpose
ztunnel	DaemonSet	3 (1 per node)	0.3 vCPU	0.4GB	L4 proxy, mTLS
istiod	Deployment	1	0.5 vCPU	2GB	Control plane
Waypoints	Deployment	2-5	0.4 vCPU	0.5GB	L7 features (selective)
Prometheus	StatefulSet	2 (HA)	1 vCPU	2GB	Metrics storage
Grafana	Deployment	1	0.1 vCPU	0.25GB	Dashboards UI
Jaeger	Deployment	1	0.5 vCPU	1GB	Distributed tracing
Kiali	Deployment	1	0.1 vCPU	0.25GB	Service graph UI
Total		10-15 pods	~3 vCPU	~6.4GB	Full stack

Storage (GCP Persistent Disks):

Prometheus: 2 × 50Gi = 100Gi ($4/month)
Jaeger: 20Gi ($0.80/month)
Total: $4.80/month

Total monthly cost:

Compute: $165/month (3 nodes + mesh overhead)
Storage: $5/month (persistent disks)
Total: $170/month for 50 pods with full observability

Compare to AWS X-Ray: $1,400+/month for distributed tracing alone!

What You Get (and What You Give Up)

✅ With ztunnel Only (L4) - 80% Use Case

Benefits:

mTLS encryption between all services (zero-trust)
L4 connection metrics (TCP stats, bytes)
Service authentication/authorization
Kiali service graph
Cost: +3% infrastructure

Limitations:

No circuit breakers (need Waypoint)
No HTTP-level metrics (only TCP)
No distributed tracing
No request retries/timeouts

Best for: Internal microservices, background workers, databases, caches

✅ With ztunnel + Selective Waypoints (L4 + L7) - 20% Use Case

Benefits (all of ztunnel, plus):

Circuit breakers (prevent cascading failures)
HTTP retries, timeouts, traffic splitting
Request-level metrics (latency, status codes)
Distributed tracing (Jaeger)
Canary deployments, A/B testing
Cost: +10-15% infrastructure

Best for: User-facing APIs, payment services, checkout flows, critical paths

⚠️ Trade-off: Granularity

Sidecar mode: Per-pod metrics

frontend-pod-1: 100 req/s, 50ms p95
frontend-pod-2: 120 req/s, 45ms p95
frontend-pod-3: 80 req/s, 60ms p95  ← Can identify slow pod

Ambient mode: Service-level metrics

frontend service: 300 req/s, 52ms p95  ← Aggregated
Cannot see individual pod performance

Solution:

Add application-level Prometheus instrumentation (expose /metrics endpoint)
Use Kubernetes native metrics (kubelet, cAdvisor) for pod-level CPU/memory
Most teams don't need per-pod HTTP metrics—service-level is sufficient

When Service Mesh Makes Sense Now (2025)

✅ Ambient Mode Opens Doors For:

Mid-size teams (15-30 services)

Cost overhead: 10-15% (vs 66% sidecar mode)
Gradual adoption: Start with L4, add L7 selectively
mTLS security without code changes

Cost-conscious organizations

3-5% overhead for zero-trust networking
Self-hosted observability: $170/month (vs $1,400+ cloud tracing)
Works perfectly with spot instances

Compliance requirements

mTLS encryption by default (HIPAA, PCI-DSS, SOC2)
Zero-trust networking (mutual authentication)
No application code changes needed

Complex microservices (20+ services)

Distributed tracing: $46/month Jaeger vs $1,400/month X-Ray
Circuit breakers for failure isolation
Real-time service dependency graphs

❌ Still Not Worth It For:

Small teams (<10 services)

Operational overhead not justified
Application-level instrumentation simpler (Prometheus client libraries)

Monoliths

No inter-service communication complexity
Traditional monitoring sufficient

Teams without Kubernetes expertise

Requires K8s debugging skills
Mesh troubleshooting adds complexity

Installation: Quick Start

Step 1: Install Istio with Ambient Mode

# Install istioctl
curl -L https://istio.io/downloadIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH

# Install Ambient profile
istioctl install --set profile=ambient -y

# Verify installation
kubectl get pods -n istio-system
# Output:
# istiod-xxx        1/1   Running
# ztunnel-xxx       1/1   Running (DaemonSet, 1 per node)

Step 2: Enable Ambient for Your Namespace

# Enable Ambient mode (ztunnel L4 only)
kubectl label namespace google-boutique istio.io/dataplane-mode=ambient

# Deploy your application
kubectl apply -f your-app.yaml -n google-boutique

# All pods now have:
# ✅ mTLS encryption (via ztunnel)
# ✅ Zero-trust authentication
# ❌ No L7 features yet (no circuit breakers)

Step 3: Add Waypoint for Critical Services (L7)

# Deploy Waypoint for frontend service only
istioctl waypoint apply \
  --service-account frontend \
  --namespace google-boutique

# Verify Waypoint deployment
kubectl get pods -n google-boutique | grep waypoint
# waypoint-frontend-abc123   1/1   Running

# Now configure circuit breaker for frontend → payment calls
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-circuit-breaker
  namespace: google-boutique
spec:
  host: payment-service
  trafficPolicy:
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
EOF

Step 4: Install Observability Stack

# Install Prometheus + Grafana
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=7d \
  --set prometheus.prometheusSpec.resources.requests.cpu=500m \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi

# Install Jaeger
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm install jaeger jaegertracing/jaeger \
  --namespace observability --create-namespace \
  --set allInOne.enabled=true \
  --set allInOne.resources.requests.cpu=500m \
  --set allInOne.resources.requests.memory=1Gi

# Install Kiali (included with Istio)
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.21/samples/addons/kiali.yaml

# Access UIs via port-forward
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
kubectl port-forward -n observability svc/jaeger-query 16686:16686
kubectl port-forward -n istio-system svc/kiali 20001:20001

# Open in browser:
# Grafana: http://localhost:3000 (user: admin, password: prom-operator)
# Jaeger: http://localhost:16686
# Kiali: http://localhost:20001

The Bottom Line: From Luxury to Practical

2017-2023: Service mesh was prohibitively expensive (30-90% cost increase). AWS gave up on App Mesh.

2024-2025: Istio Ambient mode makes service mesh affordable and practical:

3-15% overhead vs 66% sidecar mode
Node-level proxies (DaemonSet) vs per-pod sidecars
Graduated adoption: L4 first (cheap), L7 where needed (selective)
Kubernetes-native: Only works in K8s (EKS, GKE)

Who should reconsider?

Mid-size teams (15-30 services) previously priced out
Cost-conscious orgs needing mTLS compliance
Teams wanting circuit breakers without code changes
Anyone paying $1,400+/month for AWS X-Ray

The shift: Service mesh is no longer a luxury for large enterprises—it's a viable option for mid-size teams building on Kubernetes.

If you ruled out service mesh before due to cost, revisit it now. The economics have fundamentally changed.

Appendix: Complete Deployment Manifests

A. ztunnel (DaemonSet) - Installed by Istio

# Managed by: istioctl install --set profile=ambient
# You don't create this manually, but here's what it looks like:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ztunnel
  namespace: istio-system
spec:
  selector:
    matchLabels:
      app: ztunnel
  template:
    metadata:
      labels:
        app: ztunnel
    spec:
      serviceAccountName: ztunnel
      hostNetwork: true
      containers:
      - name: istio-proxy
        image: gcr.io/istio-release/ztunnel:1.21.0
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        securityContext:
          privileged: true  # Needed for iptables manipulation
        volumeMounts:
        - name: cni-bin
          mountPath: /host/opt/cni/bin
      volumes:
      - name: cni-bin
        hostPath:
          path: /opt/cni/bin

B. Waypoint Proxy (Deployment) - Created by istioctl

# Created by: istioctl waypoint apply --service-account frontend
# This is what gets deployed:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: waypoint-frontend
  namespace: google-boutique
  labels:
    istio.io/gateway-name: waypoint-frontend
spec:
  replicas: 1  # Scale up for HA
  selector:
    matchLabels:
      istio.io/gateway-name: waypoint-frontend
  template:
    metadata:
      labels:
        istio.io/gateway-name: waypoint-frontend
    spec:
      serviceAccountName: frontend
      containers:
      - name: istio-proxy
        image: gcr.io/istio-release/proxyv2:1.21.0
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        ports:
        - containerPort: 15021  # Health check
        - containerPort: 15090  # Metrics
---
apiVersion: v1
kind: Service
metadata:
  name: waypoint-frontend
  namespace: google-boutique
spec:
  selector:
    istio.io/gateway-name: waypoint-frontend
  ports:
  - port: 15021
    name: status-port

C. Prometheus (StatefulSet)

# Helm values for prometheus-community/kube-prometheus-stack
# Save as: prometheus-values.yaml
# Install: helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml

prometheus:
  prometheusSpec:
    replicas: 2  # High availability
    retention: 7d  # Keep metrics for 7 days

    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 4Gi

    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
          storageClassName: standard-rwo  # GKE persistent disk

    # Scrape Istio metrics
    additionalScrapeConfigs:
    - job_name: 'istio-mesh'
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
          - istio-system
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: istio-telemetry;prometheus

grafana:
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi

  # Pre-load Istio dashboards
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'istio'
        orgId: 1
        folder: 'Istio'
        type: file
        disableDeletion: false
        options:
          path: /var/lib/grafana/dashboards/istio

  dashboardsConfigMaps:
    istio: "istio-grafana-dashboards"

alertmanager:
  enabled: true
  resources:
    requests:
      cpu: 100m
      memory: 128Mi

D. Jaeger (All-in-One Deployment)

# For production, use separate collector/query/storage
# This is simplified all-in-one for learning

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-all-in-one
  namespace: observability
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.52
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 2000m
            memory: 2Gi
        env:
        - name: COLLECTOR_ZIPKIN_HOST_PORT
          value: ":9411"
        - name: SPAN_STORAGE_TYPE
          value: badger
        - name: BADGER_EPHEMERAL
          value: "false"
        - name: BADGER_DIRECTORY_VALUE
          value: /badger/data
        - name: BADGER_DIRECTORY_KEY
          value: /badger/key
        ports:
        - containerPort: 5775   # UDP Zipkin
          protocol: UDP
        - containerPort: 6831   # UDP Jaeger
          protocol: UDP
        - containerPort: 6832   # UDP Jaeger
          protocol: UDP
        - containerPort: 5778   # HTTP config
        - containerPort: 16686  # HTTP UI
        - containerPort: 14268  # HTTP collector
        - containerPort: 14250  # gRPC collector
        - containerPort: 9411   # HTTP Zipkin
        volumeMounts:
        - name: jaeger-storage
          mountPath: /badger
      volumes:
      - name: jaeger-storage
        persistentVolumeClaim:
          claimName: jaeger-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jaeger-pvc
  namespace: observability
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: standard-rwo
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-query
  namespace: observability
spec:
  selector:
    app: jaeger
  ports:
  - name: query-http
    port: 16686
    targetPort: 16686
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-collector
  namespace: observability
spec:
  selector:
    app: jaeger
  ports:
  - name: grpc
    port: 14250
    targetPort: 14250
  - name: http
    port: 14268
    targetPort: 14268

E. Kiali (Service Graph UI)

# Simplified Kiali deployment
# Full version: kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.21/samples/addons/kiali.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kiali
  namespace: istio-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kiali
  template:
    metadata:
      labels:
        app: kiali
    spec:
      serviceAccountName: kiali
      containers:
      - name: kiali
        image: quay.io/kiali/kiali:v1.79
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        env:
        - name: PROMETHEUS_URL
          value: "http://prometheus-kube-prometheus-prometheus.monitoring:9090"
        - name: GRAFANA_URL
          value: "http://prometheus-grafana.monitoring:80"
        - name: JAEGER_URL
          value: "http://jaeger-query.observability:16686"
        ports:
        - containerPort: 20001
        - containerPort: 9090  # Metrics
---
apiVersion: v1
kind: Service
metadata:
  name: kiali
  namespace: istio-system
spec:
  selector:
    app: kiali
  ports:
  - name: http
    port: 20001
    targetPort: 20001
  - name: metrics
    port: 9090
    targetPort: 9090
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kiali
  namespace: istio-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kiali
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - endpoints
  - namespaces
  - nodes
  - pods
  - services
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources:
  - deployments
  - replicasets
  - statefulsets
  verbs: ["get", "list", "watch"]
- apiGroups: ["networking.istio.io"]
  resources:
  - "*"
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kiali
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kiali
subjects:
- kind: ServiceAccount
  name: kiali
  namespace: istio-system

F. Complete Installation Script

#!/bin/bash
# install-ambient-stack.sh - Complete Istio Ambient + Observability setup

set -e

echo "=== Installing Istio Ambient Mode ==="
curl -L https://istio.io/downloadIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH

istioctl install --set profile=ambient -y

echo "=== Installing Prometheus + Grafana ==="
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=7d \
  --set prometheus.prometheusSpec.replicas=2 \
  --set prometheus.prometheusSpec.resources.requests.cpu=500m \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.resources.requests.cpu=100m \
  --set grafana.resources.requests.memory=256Mi

echo "=== Installing Jaeger ==="
kubectl create namespace observability || true
kubectl apply -f jaeger-all-in-one.yaml

echo "=== Installing Kiali ==="
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.21/samples/addons/kiali.yaml

echo "=== Enabling Ambient for google-boutique namespace ==="
kubectl create namespace google-boutique || true
kubectl label namespace google-boutique istio.io/dataplane-mode=ambient

echo "=== Installation Complete! ==="
echo ""
echo "Access dashboards:"
echo "  Grafana:    kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80"
echo "  Prometheus: kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090"
echo "  Jaeger:     kubectl port-forward -n observability svc/jaeger-query 16686:16686"
echo "  Kiali:      kubectl port-forward -n istio-system svc/kiali 20001:20001"
echo ""
echo "Deploy Waypoint for a service:"
echo "  istioctl waypoint apply --service-account <sa-name> --namespace google-boutique"

Have you tried Ambient mode? What's your experience with the new architecture? Share your thoughts!

Kubernetes #ServiceMesh #Istio #AmbientMode #CloudNative #Microservices #DevOps #SRE #GKE #EKS

Running Cluster on 100% Spot Instances: How K8s Does It Better Than ECS

Rex Zhen — Sun, 18 Jan 2026 21:44:54 +0000

Running Cluster on 100% Spot Instances: How K8s Does It Better Than ECS

The Challenge

Spot instances offer 60-90% cost savings, but come with a catch: 30-second termination notice. This creates reliability challenges - pod disruptions, capacity drops, and potential service degradation.

After running workloads on both ECS and Kubernetes with spot instances, I've found K8s provides architectural advantages that ECS simply cannot match. K8s has native features for coordinated shutdown, flexible scheduling constraints, and priority-based resource management that make 100% spot clusters production-viable.

Here's how K8s handles spot terminations differently.

K8s Features for Spot Reliability (Overview)

K8s provides a comprehensive set of primitives for handling spot terminations:

Graceful Shutdown: Application-level SIGTERM handling with request draining
Readiness Probe: Fast endpoint removal with failureThreshold: 1 (ECS equivalent: ALB health checks, limited to load balancer scenarios)
PreStop Hook: Coordinate shutdown timing before SIGTERM (No ECS equivalent - critical gap)
Over-Provisioning: Run excess capacity; still cheaper on spot than minimal on-demand
topologySpreadConstraints: Automatic multi-zone distribution with rebalancing
Soft Anti-Affinity: preferredDuringScheduling adapts to capacity (ECS has only hard constraints)
PriorityClass: Priority-based eviction for instant capacity reclamation (No ECS equivalent)
HorizontalPodAutoscaler: Asymmetric scaling - fast up, slow down

Key insight: 9 pods on spot ($270/mo) < 5 pods on-demand ($500/mo) with superior reliability.

(See Appendix for complete production-ready K8s configuration)

K8s vs ECS: Feature Comparison

Platform capability analysis for spot instance workloads:

Strategy	Kubernetes	ECS	Key Difference
1. Graceful shutdown	✓ Yes	✓ Yes	Application-level - identical implementation
2a. Readiness probe	✓ Yes	⚠ Partial	K8s: Any Service, `failureThreshold: 1` ECS: ALB/NLB only, minimum 2 checks
2b. PreStop hook	✓ Yes	✗ No	Critical gap: K8s delays SIGTERM for coordination ECS: Immediate SIGTERM causes ALB draining race condition
3. Over-provisioning	✓ Yes	✓ Yes	Conceptually similar, K8s features amplify effectiveness
4. Multi-zone	✓ Yes	⚠ Limited	K8s: `topologySpreadConstraints` with auto-rebalancing ECS: Task placement strategies, less dynamic
5. Soft anti-affinity	✓ Yes	✗ No	K8s exclusive: Adaptive constraints for dynamic capacity ECS: Hard constraints only, tasks can block
6. Overprovisioner	✓ Yes	✗ No	K8s exclusive: PriorityClass enables instant replacement ECS: No priority-based eviction
7. Asymmetric HPA	✓ Yes	✓ Yes	K8s: HorizontalPodAutoscaler ECS: Application Auto Scaling - comparable
PodDisruptionBudget	✓ Yes	✗ No	K8s only (voluntary disruptions, not spot)

Key Architectural Gaps in ECS

ECS missing capabilities:

PreStop hooks - No coordination mechanism; immediate SIGTERM creates load balancer draining race conditions
Soft constraints - All-or-nothing placement; tasks remain Pending when constraints conflict with capacity
Priority-based eviction - No overprovisioner pattern; cannot reclaim capacity from low-priority workloads

ECS limited capabilities:

Health checks - ALB/NLB only (K8s: any Service including internal mesh)
Multi-zone placement - Static strategies (K8s: dynamic rebalancing)

ECS equivalent capabilities:

Graceful shutdown - Application-level implementation
Over-provisioning - Task count management
Auto-scaling - Target tracking policies

Observed impact:

ECS: 0.1-1% error rate during spot terminations
K8s: <0.05% error rate with proper configuration

Node Management Layer: AWS wins

Beyond pod orchestration, there's another layer to consider: node management automation. AWS provides superior options here—EKS with Karpenter offers intelligent bin-packing at EC2 pricing ($89/month for our workload), while GKE Autopilot charges a serverless premium ($118/month for the same). For cost-conscious architectures, AWS's node management solutions (Karpenter in EKS, managed scaling in ECS) deliver better economics than GKE Autopilot's per-pod pricing model. (I'll cover this in detail in a separate post on Karpenter vs Autopilot cost models.)

Conclusion

K8s provides architectural primitives that enable production-grade spot instance reliability:

PreStop hooks eliminate shutdown race conditions
Soft constraints adapt to dynamic capacity
PriorityClass enables instant replacement

Combined with over-provisioning economics (spot discounts make excess capacity cheaper than minimal on-demand), these features make 100% spot clusters viable for production workloads.

The key difference from ECS: K8s doesn't just manage containers - it provides coordination mechanisms that enable graceful degradation under failure.

Running spot workloads? What strategies have worked for your architecture?

Appendix: Complete K8s Configuration

Production-ready configuration implementing all strategies:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api-production
spec:
  replicas: 9  # Over-provisioning: Run 80% more pods than minimum (need 5, run 9)

  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1

  template:
    spec:
      terminationGracePeriodSeconds: 30  # Total time budget for graceful shutdown
      priorityClassName: high-priority   # For overprovisioner pattern with pause pods

      # Multi-zone distribution (ECS equivalent: task placement strategies)
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: web-api

      # Soft anti-affinity (K8s exclusive - ECS can't do soft constraints!)
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:  # "preferred" = soft
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: web-api
              topologyKey: kubernetes.io/hostname

      containers:
      - name: api
        image: my-api:v1.0.0

        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2000m"    # Allow 4x burst
            memory: "2Gi"

        # Readiness probe (similar to ECS ALB health checks, but works without LB)
        readinessProbe:
          httpGet:
            path: /ready    # Your app must implement this endpoint
            port: 8080
          periodSeconds: 5
          failureThreshold: 1  # K8s allows 1, ECS minimum is 2

        # PreStop hook (K8s exclusive - ECS has NO equivalent!)
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                sleep 5      # Delay SIGTERM for endpoint removal propagation
                kill -TERM 1 # Trigger app's graceful shutdown handler
                sleep 20     # Allow app to drain in-flight requests

---
# HorizontalPodAutoscaler - Asymmetric scaling (fast up, slow down)
# ECS equivalent: Application Auto Scaling with target tracking policies
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-api-hpa
spec:
  scaleTargetRef:
    name: web-api-production
  minReplicas: 9              # Maintain over-provisioned baseline
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60  # Scale proactively before capacity exhaustion
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0   # No delay - respond to spikes immediately
      policies:
      - type: Percent
        value: 100            # Aggressive: double capacity if needed
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300  # Conservative: wait 5 min to preserve buffer
      policies:
      - type: Pods
        value: 1              # Remove only 1 pod/min to maintain over-provisioning
        periodSeconds: 60

Why Service Mesh Never Took Off (Despite Being Incredibly Powerful)

Rex Zhen — Sat, 17 Jan 2026 23:51:02 +0000

Why Service Mesh Never Took Off (Despite Being Incredibly Powerful)

The Promise Was Real

Years ago, when AWS announced App Mesh at re:Invent, I tested it out with a few microservices to see the interconnections between them. The benefits were genuinely impressive:

What service mesh solves:

Instant visibility: See traffic flow between all services in real-time
Performance insights: Identify bottlenecks across 50-200 microservices at a glance
Automatic troubleshooting: Anyone can pinpoint failures, not just senior SREs
Zero-trust security: mTLS encryption between all services, automatically

Before service mesh, only the most experienced engineers could diagnose issues across complex microservice architectures. Service mesh democratized observability.

The Feature That Changed My Mind: Circuit Breakers

This weekend, while reviewing the Kubernetes ecosystem, Istio caught my attention again. I discovered a capability I'd previously overlooked: infrastructure-level circuit breakers.

What are circuit breakers?

Think of your home's electrical circuit breaker. When there's an overload, it trips immediately to prevent damage. Service mesh does the same for your services:

Without circuit breakers:

Payment service database goes down
→ Checkout service keeps sending requests (5-second timeout each)
→ Checkout threads pile up waiting
→ Checkout service exhausts resources
→ Entire system cascades into failure

With circuit breakers (via Istio):

Payment service database goes down
→ Circuit breaker detects failures after 5 attempts
→ Circuit "opens" - stops sending requests immediately
→ Checkout returns fast errors instead of hanging
→ System degrades gracefully, doesn't crash
→ After 30 seconds, circuit tries again (half-open state)
→ If successful, circuit closes and normal operation resumes

The game-changer? Istio handles this at the infrastructure level without touching application code. Your developers don't need to implement complex retry logic, timeout handling, or failure detection in every service.

So Why Isn't Everyone Using It?

If service mesh is this powerful, why hasn't it become ubiquitous? Two reasons:

1. Operational Complexity

Service mesh adds a sidecar proxy to every pod. In Kubernetes, this means an extra container per pod to configure, manage, and troubleshoot.

The counterargument: This complexity can be hidden in Helm charts or Terraform modules. However, when things go wrong, your team needs to debug both application logic AND mesh configuration. This doubles the cognitive load.

2. Cost (The Real Killer)

Service mesh isn't free. Here's the math:

Infrastructure overhead:

Each pod runs an additional sidecar proxy consuming CPU and memory
Depending on traffic patterns, expect 30-90% increase in compute costs
A 100-node cluster now needs 130-190 nodes to handle the same workload

Observability costs:

Massive telemetry data volume sent to Prometheus/Grafana
AWS X-Ray (AWS's distributed tracing service) charges per trace received - this scales with traffic
At high volume (1000+ req/s), AWS X-Ray costs can reach $1,400+/month per service

Real-world example:

Base GKE cluster (50 pods): $148/month (Spot VMs)
Add Istio service mesh: +$58/month (sidecars)
Add observability backends: +$76/month (Jaeger, Prometheus)
───────────────────────────────────────────────────
Total: $282/month (90% cost increase)

Compare this to AWS X-Ray's per-request pricing model, and you'll understand why teams abandon it at scale. The billing shock is real.

The Bottom Line

Service mesh is powerful, but expensive. It makes sense for:

Large organizations (20+ microservices, multiple teams)
Strict security/compliance requirements (mandatory mTLS)
Complex architectures where troubleshooting time savings justify the cost

It does NOT make sense for:

Small teams (<10 services)
Cost-sensitive environments
Simple architectures

My take: Service mesh is a luxury, not a necessity. Most benefits can be achieved with application-level instrumentation at a fraction of the cost. Reserve service mesh for when you truly need it.

Have you tried service mesh in production? What was your experience? Would love to hear your thoughts in the comments.

The ECS Spot Instance Dilemma: When Task Placement Strategies Force Impossible Trade-Offs

Rex Zhen — Tue, 13 Jan 2026 16:44:32 +0000

The Operational Reality of Spot Instances

Spot instances offer compelling cost savings—often 60-70% compared to on-demand pricing. For organizations running containerized workloads, this translates to substantial infrastructure budget reductions. The business case is clear: migrate to spot instances wherever possible.

However, adopting spot instances introduces a challenging operational problem.

The Problem: Alarm Fatigue and Service Degradation

Spot instances terminate frequently—sometimes multiple times per day across a cluster. Each termination triggers cascading effects:

Monitoring alerts fire continuously:

CloudWatch alarms: "ECS service below desired task count"
Application metrics: Spike in 5xx errors during task replacement
Load balancer health checks: Temporary target unavailability
Cluster capacity warnings: "Instance terminated in availability-zone-a"

Customer-facing impact:

External monitoring (Pingdom, Datadog) detects brief service degradation
5xx error rates spike for 30-90 seconds during task rescheduling
Response times increase while remaining tasks handle full load
On-call engineers receive pages for incidents that "self-heal" within minutes

The irony: services recover automatically through ECS's built-in resilience mechanisms, but not before generating alerts, incident tickets, and potential customer complaints.

The Obvious Solution Has an Expensive Catch (For Small Clusters)

The standard recommendation for spot resilience is straightforward: spread tasks across multiple instances using ECS placement strategies.

{
  "placementStrategy": [
    {"type": "spread", "field": "instanceId"}
  ]
}

This configuration ensures that losing one instance affects only a small percentage of total capacity. The blast radius becomes manageable.

The problem: This approach works well at scale but becomes prohibitively expensive for small-to-medium services and clusters.

Large service (100+ tasks):

100 tasks spread across 15-20 instances
Each instance: 5-7 tasks (50-70% utilization)
Spread strategy achieves good distribution AND efficient resource usage
✅ Problem minimal: Tasks naturally fill available capacity

Small-to-medium service (5-20 tasks):

10 tasks spread across 10 instances
Each instance: 1 task (10-20% utilization)
Spread strategy forces massive over-provisioning
❌ Problem severe: 80-90% of resources wasted

Note: In practice, small services typically run in small clusters (one or a few services per cluster), so "small service" and "small cluster" often refer to the same deployment pattern.

The cost impact:

Spot savings: 60% reduction = $400/month saved
Over-provisioning penalty: 8 idle instances = $600/month wasted
Net result: Higher costs than running on-demand without spot instances

Organizations running small-to-medium clusters (the majority of microservices deployments) face a dilemma:

Option A: Accept frequent alarms and occasional customer-facing incidents (operational burden)
Option B: Over-provision instances for resilience (eliminates cost savings)
Option C: Revert to on-demand instances (forfeit 60% savings opportunity)

None of these options are satisfactory for small-to-medium workloads. Let's analyze this technical challenge in detail and explore how different orchestration platforms handle this scale-dependent problem.

The "Impossible Triangle" (For Small-to-Medium Clusters)

This operational challenge can be visualized as an optimization problem with three competing objectives:

       Spot Resilience
    (minimize alarm fatigue
     & customer impact)
            /\
           /  \
          /    \
         /      \
        /        \
       /          \
      /____________\
Cost           Auto-Scaling
Efficiency     (5-20 tasks)

Challenge: Optimize for all three simultaneously

Important context: This problem is scale-dependent. Large services (50+ tasks) naturally solve this triangle—enough tasks to both spread across instances AND utilize resources efficiently. The dilemma is specific to small-to-medium clusters where individual services have 5-20 tasks, representing the majority of modern microservice deployments.

In practice, organizations discover that container orchestration platforms force trade-offs between these objectives for smaller services. Achieving all three requires either platform-specific workarounds or architectural capabilities that some platforms simply don't provide.

AWS ECS: Exploring Placement Strategies

Approach 1: Maximum Spread Strategy (Solves Alarms, Destroys Budget)

The most straightforward approach to eliminating alarm fatigue is maximizing task distribution:

{
  "serviceName": "api-service",
  "desiredCount": 10,
  "capacityProviderStrategy": [{
    "capacityProvider": "spot-asg-provider",
    "weight": 1
  }],
  "placementStrategy": [
    {
      "type": "spread",
      "field": "instanceId"
    }
  ]
}

Behavior:

ECS places 1 task per instance (maximum distribution)
Capacity Provider provisions 10 instances for 10 tasks
Each instance: ~10-20% resource utilization
Cost: $250/month (10 × m5.large spots @ $25/month)

Operational impact:

✅ Spot termination affects only 1 task (10% capacity loss)
✅ No Pingdom alerts: Service handles loss gracefully
✅ Minimal 5xx error spikes: 90% of capacity remains available
✅ CloudWatch alarms stay quiet: Task replacement happens within normal thresholds

Cost impact:

❌ Resource utilization: 10-20% per instance (80-90% waste)
❌ Over-provisioning: 8-9 instances running mostly idle
❌ Scale-down lag: ASG retains instances during low-demand periods
❌ Net cost higher than on-demand baseline

The paradox: This configuration solves the operational problem (no alarms, no incidents) but negates the entire financial justification for using spot instances in the first place.

Approach 2: Binpack Strategy (Saves Money, Triggers Alarms)

To reclaim cost efficiency, the next approach focuses on resource utilization:

{
  "placementStrategy": [
    {
      "type": "spread",
      "field": "attribute:ecs.availability-zone"
    },
    {
      "type": "binpack",
      "field": "memory"
    }
  ]
}

Behavior:

ECS spreads across availability zones, then binpacks within each zone
Capacity Provider provisions 3 instances for 10 tasks
Each instance: 70-80% resource utilization
Cost: $75/month (3 × $25/month)

Task distribution:

Instance 1 (spot): 4 tasks
Instance 2 (spot): 3 tasks
Instance 3 (spot): 3 tasks

Cost impact:

✅ Resource utilization: 70-80% (efficient)
✅ Spot savings realized: ~60% vs on-demand
✅ Auto-scaling works: Capacity Provider adjusts instance count

Operational impact:

❌ Spot termination blast radius: 30-40% capacity loss
❌ Pingdom alerts fire: 5xx error rate spikes above threshold
❌ CloudWatch alarms trigger: "Service degraded - insufficient healthy tasks"
❌ Recovery lag: 3-5 minutes for new instance + task startup
❌ Customer complaints: Brief but noticeable service interruptions

The incident pattern: When Instance 1 terminates (daily occurrence), 4 tasks disappear simultaneously. Remaining 6 tasks handle 100% of traffic, causing:

Response time degradation (overload)
Connection timeouts (queue saturation)
5xx errors (backend unavailable)
PagerDuty/on-call escalation

By the time engineers acknowledge the page, ECS has already recovered. But the alarm fatigue accumulates—multiple times per day, every day.

Approach 3: Capacity Provider targetCapacity

A common misconception is that targetCapacity controls task distribution:

{
  "capacityProvider": "my-asg-provider",
  "managedScaling": {
    "targetCapacity": 60
  }
}

Reality: targetCapacity determines the cluster utilization threshold for triggering scale-out, not how tasks are distributed across instances.

Behavior:

targetCapacity: 100 = Scale when cluster reaches 100% capacity
targetCapacity: 60 = Scale when cluster reaches 60% capacity (maintains 40% headroom)

With a binpack strategy, tasks still concentrate on fewer instances. Lower targetCapacity provisions more instances but doesn't change the distribution pattern—the additional instances remain underutilized.

Common ECS Workarounds

Workaround 1: Small Instance Types

Use instance types with limited capacity to physically constrain task density:

{
  "placementStrategy": [
    {"type": "spread", "field": "instanceId"}
  ]
}

// ASG Configuration
// Instance type: t4g.small (2GB RAM)
// Task memory requirement: 1GB
// Physical limit: 2 tasks per instance maximum

Outcome:

10 tasks → 5 instances required (2 tasks each)
Cost: 5 × $5/month = $25/month
Blast radius: 20% (acceptable for most use cases)

Trade-off: This approach uses physical constraints as a proxy for scheduling policy, which feels architecturally inelegant.

Note: For small ECS clusters, this workaround effectively balances cost efficiency and spot protection. However, this raises a broader architectural question: should clusters use many small instances or fewer large instances? That debate involves considerations around bin-packing efficiency, operational overhead, blast radius philosophy, and AWS service limits—topics beyond the scope of this discussion. For the specific problem of spot resilience in small services, small instance types provide a pragmatic solution regardless of overall cluster architecture.

Workaround 2: Hybrid On-Demand + Spot

{
  "capacityProviderStrategy": [
    {
      "capacityProvider": "on-demand-provider",
      "base": 3,
      "weight": 0
    },
    {
      "capacityProvider": "spot-provider",
      "base": 0,
      "weight": 1
    }
  ]
}

Outcome:

First 3 tasks on on-demand instances (never terminated)
Tasks 4-10 on spot instances (cost-optimized)
Spot termination affects only 10-30% of capacity
Base capacity remains stable

Cost:

On-demand: 3 instances × $50/month = $150/month
Spot: 2-4 instances × $15/month = $30-60/month
Total: $180-210/month

Trade-off: Higher baseline cost for improved reliability.

Alternative: Kubernetes Addresses This Naturally

Other container orchestration platforms handle this problem differently. Kubernetes, for example, provides topologySpreadConstraints that directly specify the maximum number of pods per node:

spec:
  topologySpreadConstraints:
  - maxSkew: 2  # Max 2 pods per node
    topologyKey: kubernetes.io/hostname

This simple configuration achieves all three objectives for small-to-medium clusters:

✅ Spot resilience: 20% blast radius (2 pods per node)
✅ Cost efficiency: 5 nodes instead of 10 (50% reduction)
✅ Auto-scaling: Cluster autoscaler adjusts node count dynamically

The maxSkew parameter provides granular control (1, 2, 5, etc.) over the distribution density, enabling precise optimization along the resilience-efficiency spectrum—something ECS placement strategies cannot express directly.

The Fundamental Architectural Difference

The core issue isn't ECS inadequacy—it's an architectural constraint for small-to-medium clusters:

ECS lacks granular per-instance task limits.

Available strategies:

spread by instanceId = Exactly 1 task per instance (maximum spread, works well for large services)
binpack = As many tasks as resources allow (maximum density)
spread by AZ + binpack = Zone distribution, then density (no per-instance control)

For small-to-medium clusters (5-20 tasks per service), these binary options force choosing between over-provisioning (spread) or excessive blast radius (binpack). There's no middle ground to specify "aim for 2-3 tasks per instance."

When ECS Remains the Better Choice

Despite these limitations, ECS is often the pragmatic choice when:

Large-scale deployments: Services running 50+ tasks naturally achieve efficient distribution with spread strategies
Simple placement requirements: Consistent task count, no spot instances, availability zone distribution sufficient
Deep AWS integration needed: Native IAM roles, ALB/NLB integration, CloudWatch, ECS Exec
Team expertise: Existing operational knowledge, established runbooks, monitoring dashboards
Fargate deployment: Serverless container management without EC2 instance overhead
Managed control plane: No cluster version management, automatic scaling, maintenance-free

Critical insight: The "impossible triangle" primarily affects small-to-medium clusters (5-20 tasks per service). At larger scales (50+ tasks per service), spread strategies achieve both good distribution and efficient resource usage simultaneously. ECS's simpler model reduces operational complexity for straightforward use cases and scales excellently for high-volume services.

Key Takeaways

Scale-Dependent Problem: The "impossible triangle" primarily affects small-to-medium clusters (5-20 tasks per service). Large services (50+ tasks) naturally achieve both good distribution and efficient resource usage.
Root Cause: ECS lacks granular per-instance task limits—only extreme options exist (1 task/instance spread OR full binpack), with no middle ground.
Practical Workarounds: Small instance types (t4g.small) provide the most effective solution, physically limiting task density while maintaining cost efficiency ($25/month vs $250/month).
Platform Limitations: Other orchestration platforms provide granular controls that directly address this problem, highlighting an architectural constraint rather than a configuration issue.

Conclusion

The spot instance adoption dilemma reveals a fundamental constraint in ECS's task placement architecture: the absence of granular per-instance task limits.

The scale-dependent reality: For large-scale services (50+ tasks), ECS placement strategies work excellently—tasks naturally distribute across instances while maintaining efficient resource utilization. The "impossible triangle" problem emerges specifically for small-to-medium clusters (5-20 tasks per service) that dominate modern microservice architectures.

For these smaller clusters:

Spread strategy eliminates alarms but destroys cost efficiency
Binpack strategy saves money but triggers constant operational incidents
Workarounds exist (small instances, hybrid capacity) but add complexity
Organizations ultimately choose: accept alarm fatigue OR forfeit spot savings

The broader lesson: Container orchestration platforms make architectural trade-offs that favor certain workload profiles. ECS's binary placement options (spread vs binpack) scale well at the extremes—either very large services or services where cost takes priority over operational stability.

Understanding these platform constraints enables realistic expectations and informed architectural decisions. When evaluating ECS for spot instance deployments, the critical question becomes: Does your cluster size align with where ECS placement strategies excel?

For small-to-medium clusters, the operational pain of alarm fatigue may ultimately outweigh the promised cost savings—making the spot instance business case less compelling than it initially appears.

Running ECS on spot instances? Struggling with alarm fatigue or over-provisioning? Share your experiences and workarounds in the comments.

Further Reading:

AWS ECS Task Placement Strategies - Official documentation
ECS Capacity Providers - AWS Best Practices
Amazon EC2 Spot Instance Interruptions - Understanding spot termination behavior

Connect with me on LinkedIn: https://www.linkedin.com/in/rex-zhen-b8b06632/

I share insights on cloud architecture, container orchestration, and SRE practices. Let's connect and learn together!

AWS Multi-Account Architecture: The Hidden Tradeoffs Everyone Discovers

Rex Zhen — Tue, 06 Jan 2026 15:49:51 +0000

AWS Multi-Account Architecture: The Hidden Tradeoffs Everyone Discovers

Introduction

This is a follow-up to my previous article: AWS SRE's First Day with GCP: 7 Surprising Differences. I want to dive deeper into one of the most painful organizational challenges I've seen: multi-account architecture.

If you've managed AWS infrastructure for multiple teams, you know the pattern: Start with a few accounts for environment isolation. Add more for team autonomy. Soon you're managing 30-40 accounts with inconsistent networking patterns, and every stakeholder is compromising.

Here's what nobody says out loud: This isn't a people problem or a process problem. AWS account boundaries force impossible tradeoffs between isolation, simplicity, and cost. The organizational chaos is a feature, not a bug.

There maybe a better way. Let's talk about what's actually happening in the real world first.

The Real-World Business Requirements

Every organization has these two fundamental requirements:

Requirement 1: Environment Isolation

"Production must be completely isolated from dev/staging/QA"

Why:

Security: Dev credentials can't access prod data
Compliance: SOC2, PCI-DSS, HIPAA require environment separation
Blast radius: Bug in dev shouldn't bring down prod
Change control: Prod changes need approval, dev doesn't

✅ This makes sense. Everyone agrees.

Requirement 2: Project Team Autonomy

"Each project team wants full control, no visibility from other teams"

Why:

Team ownership: Frontend team doesn't want backend team touching their resources
Security: Teams shouldn't see each other's secrets, databases, logs
Velocity: Teams want to move fast without stepping on each other
Organizational boundaries: Teams want clear responsibility zones

✅ This also makes sense. Reasonable request.

The Catch: Projects Need to Communicate

"But wait... frontend needs to call backend APIs. Backend needs ML service. ML needs data pipeline."

Now you need:

✅ Environment isolation (prod separate from dev)
✅ Project isolation (teams can't see each other)
✅ Service communication (teams need to talk)

These requirements seem compatible. They're not. At least not in AWS.

How This Plays Out in AWS (The Reality)

In AWS, "account" is your isolation boundary. Technically you CAN have fine-grained isolation within an account—using IAM policies, resource tags, and naming conventions—but the complexity is so high it becomes impractical at scale consistently. So organizations face an impossible choice:

Strategy 1: Account-per-Environment (Most Common)

Pattern: Each project team gets 4 accounts (prod, staging, QA, dev)

Organization
├── Frontend Team
│   ├── frontend-prod (account)
│   ├── frontend-staging (account)
│   ├── frontend-qa (account)
│   └── frontend-dev (account)
├── Backend Team
│   ├── backend-prod (account)
│   ├── backend-staging (account)
│   ├── backend-qa (account)
│   └── backend-dev (account)
└── ML Team
    ├── ml-prod (account)
    ├── ml-staging (account)
    ├── ml-qa (account)
    └── ml-dev (account)

For 10 teams: 40 AWS accounts

Problems:

❌ Account sprawl: 40 accounts to manage
❌ IAM complexity: Cross-account roles everywhere
❌ Cost visibility: Splitting bills across 40 accounts
❌ Service Limits: 40× service quota requests
❌ Networking hell: How do frontend-prod and backend-prod talk?
- Option A: VPC Peering (10 teams = 45 peering connections PER environment = 180 total)
- Option B: Transit Gateway ($36 + $360 attachments = $396/month × 4 envs = $1,584/month)

Result: Platform teams drowning in account management, teams complaining about cross-account friction, finance asking why the cloud bill is so high.

Strategy 2: Account-per-Project (Seems Better?)

Pattern: Each team gets ONE account with all environments inside

Organization
├── Frontend Account
│   ├── frontend-prod-vpc (in account)
│   ├── frontend-staging-vpc (in account)
│   ├── frontend-qa-vpc (in account)
│   └── frontend-dev-vpc (in account)
├── Backend Account
│   └── All envs in same account
└── ML Account
    └── All envs in same account

For 10 teams: 10 AWS accounts (better!)

Problems:

❌ Blast radius: Junior developer with dev access accidentally deletes prod database (same account = same IAM boundary)
❌ Compliance failure: Auditor asks "How do you prevent dev credentials from accessing prod?" Answer: "We trust our IAM policies..."
❌ Security team pushback: "Why does anyone with dev access have ANY IAM permissions in the same account as prod?!"
❌ Still need Transit Gateway: To connect frontend-account to backend-account
- Cost: $36 + $360 (10 attachments) = $396/month

Result: Security team blocks this approach, compliance fails audit, back to Strategy 1.

Strategy 3: Mix of Both (What Actually Happens)

Reality: Different teams negotiate different patterns based on their priorities:

Organization (the actual mess)
├── Frontend Team: "We want control!" → 4 accounts (per-env)
├── Backend Team: "Too many accounts!" → 1 account (all envs)
├── Data Team: "We need compliance!" → 2 accounts (prod separate, non-prod shared)
├── ML Team: "We're new here" → 1 account (all envs)
├── Platform Team: "Shared services?" → 4 accounts (per-env)
└── Legacy Systems: 17 accounts (organic growth over years)

For 10 teams: Anywhere from 15-40 accounts, no consistent pattern

Problems:

❌ Organizational chaos: Every team has a different structure
❌ Documentation nightmare: "Which account is staging for the payment service?"
❌ Networking topology unknown: VPC peering connections everywhere, some through Transit Gateway, some not
❌ Onboarding friction: New engineers face a steep learning curve understanding the account structure
❌ Tool proliferation: Different deployment tools per team (no standard works for all patterns)
❌ Cost allocation complexity: "How much does staging cost across all teams?" becomes a multi-hour manual exercise

The quarterly meeting that happens:

VP Eng: "Can we standardize our AWS account structure?"
Platform Lead: "Different teams have different requirements."
Security: "Compliance needs prod isolated."
FinOps: "Cost tracking is nearly impossible."
Team A: "Don't touch our accounts, they work!"
Team B: "Can we PLEASE consolidate? We have too many accounts."
VP Eng: "Let's form a working group..."
[Working group meets for 6 months, produces detailed proposal, minimal changes]

The Root Problem: AWS Account is the Wrong Abstraction

The fundamental issue: AWS account is simultaneously:

Billing boundary
IAM boundary
Service quota boundary
Networking boundary

You can't optimize for all four simultaneously.

Need team isolation? → More accounts → Networking complexity
Need simple networking? → Fewer accounts → No team isolation
Need environment isolation? → More accounts → Cost tracking nightmare
Need cost visibility? → Fewer accounts → Security risk

Pick your poison. No one is satisfied.

The "No One is Happy" Reality Check

In practice, this pattern repeats across organizations of all sizes:

Development Teams are Unhappy

"Cross-account deployments are too slow"
"Why do I need to assume 3 roles just to debug?"
"Can't we just put everything in one account?"

Security Team is Unhappy

"Teams keep requesting overly broad IAM permissions"
"How do we effectively audit 40 accounts?"
"Another team put dev and prod in the same account"

Finance/FinOps is Unhappy

"Cost allocation tags aren't propagating correctly"
"Can someone explain why we have 52 NAT Gateways?"
"Our AWS bill is 40% networking overhead"

Platform/SRE Team is Unhappy

"Debugging cross-account networking takes days"
"We have 3 different Transit Gateway hubs(maybe more) now"
"Onboarding a new service takes a week because of account setup"
"Every team has a different deployment pattern"

Management is Unhappy

"Why did this simple feature take 3 sprints?"
"Our AWS bill grew 40% but we only added 2 new services?"
"Can someone draw me a diagram of our network architecture?"

The Equilibrium: Everyone Compromises

In most organizations, you eventually reach a compromise:

Security accepts some risk
Dev teams accept some friction
Finance accepts some waste
Platform teams absorb the complexity

This equilibrium is stable, but no one is happy. It's widely accepted as "the cost of doing business in the cloud."

But how we come to this?

A Brief History

Remember when we moved from physical data centers to AWS? System admins from the colocation facilities were blown away. No more:

Running network cables between racks
Configuring physical routers and switches
Waiting weeks for hardware procurement
Managing VLAN trunks and BGP peering sessions

AWS was magical. Click a button, get a VPC. Define subnets in code. Launch instances instantly.

The promise: "Infrastructure as code will make everything simple."

Fast forward years later: We're managing dozens of AWS accounts, debugging cross-account IAM roles, network connections, and having recurring discussions about why and how we need to restructure the account layout every couple of years.

We eliminated physical network complexity... and replaced it with organizational network complexity.

Can AWS Address These Issues?

Technically, yes. AWS has the tools:

Service Control Policies (SCPs) for guardrails
AWS Organizations for centralized management
Resource Access Manager (RAM) for subnet sharing
StackSets for standardized deployments
Control Tower for account vending

But here's the reality: Implementing and enforcing these consistently across dozons of accounts over multiple years is extremely difficult. It requires:

Dedicated platform team maintaining complex automation
Perfect documentation that stays current
Universal buy-in from all teams
Continuous enforcement against drift

In practice, organizations accept some imperfection. Teams find workarounds. Standards erode over time. The goal becomes "keep the key features working" rather than "maintain perfect consistency."

The technical solution exists. The organizational discipline to maintain it long-term often doesn't.

There Are Different Approaches

Here's the interesting part: The multi-account networking problem isn't universal to cloud computing. It's specific to how AWS architected their account model.

Other cloud providers approached the isolation problem differently. GCP, for example, has a concept called Shared VPC that addresses these exact requirements architecturally:

Environment isolation: Separate VPCs for prod/staging/dev (just like AWS)
Team autonomy: Each team gets their own project with separate billing, IAM, and resource ownership
Service communication: Teams share the same VPC but use subnet-level IAM to control access
No Transit Gateway needed: Firewall rules with network tags handle communication

The result? Teams get isolation without networking complexity. No VPC peering mesh. No Transit Gateway. No cross-account IAM gymnastics.

I'm not saying GCP is "better." I'm saying AWS's account model forces architectural tradeoffs that other clouds don't require. Understanding this helps contextualize why AWS multi-account architecture feels so complex—because it is, by design.

Key Takeaways

1. AWS Account is the Wrong Abstraction for Team Isolation

AWS accounts are simultaneously: billing boundary, IAM boundary, quota boundary, AND networking boundary. You can't optimize for all four. This architectural decision creates the organizational chaos described above.

2. "Best Practices" Often Solve Platform Limitations

Multi-account architecture, Transit Gateway, and cross-account IAM patterns are presented as AWS best practices. But these solve AWS-specific limitations rather than universal infrastructure problems.

3. Organizational Complexity Compounds Over Time

Transit Gateway is reliable and well-supported. But consider the organizational cost:

Onboarding friction for new teams
Debugging difficulty across accounts
Documentation that becomes outdated
Tool proliferation (different teams, different patterns)

The technical solution works. The organizational cost remains.

4. Question Your Isolation Requirements

AWS culture emphasizes: "Isolate everything!"

Sometimes necessary. Often overkill. Teams in the same environment typically SHOULD share infrastructure. Over-isolation creates complexity without proportional security benefit.

5. Compare Architectural Approaches Across Clouds

If you're starting a new organization or reevaluating your infrastructure:

Understand how different clouds solve isolation differently
Don't assume AWS patterns are universal requirements
Consider whether your complexity comes from business needs or platform limitations

The goal isn't to abandon AWS. The goal is to understand which problems are inherent to your business vs which are artifacts of your platform choice.

Building multi-cloud infrastructure? Learning about cloud networking patterns? Share your experiences and questions in the comments.

This article is part of a series exploring practical cloud architecture patterns and comparing approaches across AWS, GCP, and Azure.

Connect with me on LinkedIn: https://www.linkedin.com/in/rex-zhen-b8b06632/

I share insights on cloud architecture, SRE practices, and multi-cloud engineering. Let's connect and learn together!