The first time things broke in production, I didn’t even know where to look.
Not “figuratively lost.” I mean literally.
Pods were restarting. Logs were half there. Metrics were… somewhere.
And I was SSH’ing into nodes like it was 2016, trying to understand why a simple Node.js service decided to disappear mid traffic.
That was the moment I quietly said to myself:
“Maybe choosing EKS wasn’t the flex I thought it was.”
A few months earlier, the decision felt obvious.
We were starting fresh. New service. Decent traffic expectations. Nothing massive, but not trivial either.
Someone asked:
“ECS or EKS?”
And I jumped in a bit too quickly:
“Let’s go with Kubernetes. Future proof. Industry standard.”
That word future proof has cost me time more than once.
What I thought I was choosing
At the time, I framed it like this:
- ECS → simple, but limiting
- EKS → powerful, flexible, scalable
So naturally… I chose “powerful.”
What I didn’t realize then is:
Power comes with a tax. And it’s not always visible upfront.
The first few weeks felt fine
Honestly, the initial setup wasn’t even that bad.
- Cluster came up
- Services deployed
- ALB hooked in
- Things were… running
I remember feeling slightly proud. Like I had unlocked some next level infrastructure badge 😅
But that feeling didn’t last long.
The slow creep of complexity
It didn’t hit all at once.
It showed up in small, annoying ways.
Logs weren’t where I expected
With ECS, logs just kind of… show up in CloudWatch.
With EKS, I had to think about:
- Fluent Bit / Fluentd
- Sidecars vs DaemonSets
- Log routing
Nothing impossible. Just… extra decisions.
And every decision added surface area for mistakes.
Deployments became “events”
A simple deploy wasn’t simple anymore.
Something failed once because:
- readiness probe was too aggressive
- pod restarted before warming up
- rollout got stuck
The fix was easy after I understood what happened.
But getting there? That took time.
What I didn’t realize at the time:
Kubernetes doesn’t fail loudly. It fails… descriptively. And you have to know where to look.
Node level issues became my problem
This one annoyed me the most.
I had to think about:
- node scaling
- resource fragmentation
- pod scheduling
At one point, we had enough CPU overall… but pods still couldn’t schedule.
Because no single node had enough contiguous resources.
That’s the kind of problem I never had with ECS Fargate.
And honestly… I didn’t want to have it.
The incident that changed my mind
We had a traffic spike. Nothing crazy. Maybe 3x normal load.
Auto scaling kicked in… kind of.
New nodes were coming up. Pods were pending. Some were stuck.
Meanwhile:
- existing pods were overloaded
- latency increased
- a few endpoints started timing out
And I was watching this cascade happen in slow motion.
The worst part?
Everything looked “configured correctly.”
In hindsight, the problem was:
- cluster autoscaler lag
- node provisioning delay
- pod scheduling constraints
Individually, each makes sense.
Together… they create friction exactly when you don’t want it.
That day, I kept thinking:
ECS would’ve handled this more predictably.
Maybe not perfectly. But predictably.
What I misunderstood about ECS
I had dismissed ECS too quickly.
I assumed:
- it wouldn’t scale as well
- it was less flexible
- it was somehow “less serious”
That was ego talking, not experience.
Because in reality:
ECS (especially Fargate) removes entire categories of problems:
- no node management
- no scheduler tuning
- no cluster level debugging
You give up control, yes.
But you also give up responsibility.
And sometimes that’s exactly what you want.
The trade offs are real (and uncomfortable)
I’m not saying EKS is bad. It’s not.
In fact, there are cases where I’d absolutely choose it again:
- multi cloud strategy
- heavy Kubernetes ecosystem usage
- custom controllers / operators
- deep networking requirements
But here’s the uncomfortable truth:
Most projects don’t need that level of control.
Mine didn’t.
And I paid for that mismatch with time, complexity, and a few stressful evenings.
What I’d do differently now
If I were making that decision again, I’d ask a very different question.
Not:
“What’s more powerful?”
But:
“What problems do I actually want to own?”
Because that’s what this decision really is.
With EKS, you own:
- cluster behavior
- scheduling quirks
- scaling edge cases
With ECS, you give that up.
And honestly… I’d start with ECS now.
Especially if:
- the team is small
- infra isn’t the product
- speed matters more than flexibility
I’d move to EKS only when ECS starts getting in the way.
Not before.
The part I didn’t expect
The hardest part wasn’t learning Kubernetes.
It was unlearning the idea that “more control = better engineering.”
It doesn’t.
Sometimes better engineering is:
- fewer moving parts
- fewer decisions
- fewer things that can break at 2 AM
Final thought
I don’t regret learning EKS.
But I do regret choosing it too early.
There’s a difference.
And if you’re at that crossroads right now, trying to decide…
Just remember:
You’re not choosing a tool.
You’re choosing a set of problems.
Pick carefully 🙂
Top comments (0)