Node.js Microservices in Production: Mistakes We Made So You Don't Have To

#ai #webdev #programming #career

We were really sure, about this it is of funny to think about now.

The big system we had was becoming a problem. It took a time to make changes the code was a mess and when we tried to fix one thing something else would break. Our team was getting bigger. People were getting in each others way.

Breaking it down into services seemed like the best solution. Using Node.js made sense for what we were trying to build. We already knew JavaScript so that was a plus. The team agreed with the plan. We had everything figured out.

What we didn't have was a realistic picture of what running distributed services in production actually feels like. Not in a tutorial. Not in a staging environment with fake traffic. In production, with real users, real load, and real failures happening at the worst possible times.

Some of these mistakes are embarrassing. All of them were expensive. Here they are.

We Split Everything. Way Too Much.

The first few weeks felt productive. We were drawing service boundaries, naming things, feeling architectural. User service. Auth service. Notification service. Email service — separate from notifications because emails felt different enough to justify it at the time. Looking back I'm not sure what we were thinking with that one.

By the time we had fifteen services, something had shifted. Deployments were more frequent but also more nerve-wracking, not less. A simple feature that touched user data and triggered a notification now required coordinating changes across three repositories, three deployment pipelines, and careful sequencing to avoid breaking things mid-deploy. The team was spending more time managing the services than building product.

What we'd built wasn't really microservices. It was a distributed monolith — all the operational overhead of separate services, none of the actual independence. Service boundaries that look clean on a whiteboard but in practice mean every meaningful operation requires four network hops and the failure of any one service takes down the user-facing feature entirely.

The lesson, which took longer than it should have to internalize: split services when you have a concrete operational reason. Different scaling requirements. Genuinely separate team ownership. Different deployment cadences driven by actual need. Not because the concept feels separable in a design meeting.

Everything Was Synchronous and We Didn't Notice Until It Hurt

REST calls between services felt natural. Service A needs something from Service B, it asks, it waits, it gets an answer. This is how we think about function calls. It felt like the same thing with network in between.

It isn't the same thing.

When Service B got slow — not broken, just slow, maybe 800ms response times under load instead of 50ms — everything waiting on it got slow. Service A got slow. The services waiting on Service A got slow. Eventually something at the top of the chain started timing out and users started seeing errors. A performance problem in one service had become an outage that looked, from the outside, like the whole system was down.

We added message queues eventually. For operations where the caller genuinely doesn't need to wait for a result, events made the coupling disappear. Service A publishes something happened. Service B processes it when it can. Service A has moved on. A slow or temporarily unavailable downstream service stops being a crisis and starts being a queue that catches up when things recover.

Should have done this earlier. Much earlier. The synchronous-by-default approach felt simpler and in the short term it was. In production it was the source of most of our worst incidents.

Four Days Debugging a Production Issue With No Distributed Tracing

I want to be specific about this one because the memory is still unpleasant.

We had logging. Good logging, or so we thought. Each service logged what it was doing. But when a weird failure started happening — intermittent, only under load, only affecting a specific subset of users — tracking it down meant manually pulling logs from multiple services and trying to correlate them by timestamp. Without a shared request ID propagated across service boundaries, you're essentially guessing which log entries belong to the same request.

It took four days. The actual bug, once found, was fixed in about forty minutes.

We set up proper distributed tracing the following week. OpenTelemetry, correlation IDs on every request, the full picture. The next incident of similar complexity took three hours to resolve instead of four days. That's the ROI. It's not subtle.

Build this before you need it. The instrumentation work is not exciting. It also saves you from the specific kind of suffering that comes from debugging a distributed system blind.

Services Sharing Database Tables

This one we inherited partially from the monolith migration and partially from shortcuts taken under deadline pressure. Some services just read directly from tables that were logically owned by other services because it was faster.

It's always faster until it isn't.

Two services depending on the same database schema means changing that schema requires coordinating both services. What should be a straightforward migration becomes a multi-step deployment with careful sequencing. Add a third service reading from the same tables and you've created a coordination problem that grows worse over time, not better.

And the insidious part is it doesn't feel bad immediately. The shortcuts work. The feature ships. The problem only becomes visible months later when someone needs to change the schema and discovers four services are reading from it.

Service owns its data. Other services get that data through the owning service's API. This is slower to build and cleaner to operate. The tradeoff is obvious and the right answer is still the API boundary every time.

We Added Circuit Breakers After the Incident, Not Before

There was a specific afternoon I remember where a downstream service started responding in four to five seconds instead of the usual 50 milliseconds. We weren't calling it that frequently. Shouldn't have been a big deal.

Except our connection pool filled up with requests waiting on those slow responses. Our service started queuing. Response times climbed. The service upstream from us started seeing timeouts. Within twenty minutes we had three services in a degraded state and users experiencing failures across features that had nothing to do with the slow downstream service.

Circuit breakers would have contained this. When a downstream service is slow or failing, a circuit breaker starts failing fast — returning an error immediately instead of waiting. Your service stays healthy. You handle the degraded dependency intentionally. The failure doesn't propagate.

opossum is the library we eventually used. Implementation took a day. The incident that would have taken it seriously took a week to fully recover from in terms of user trust and internal postmortems. That math isn't complicated.

Hiring People Who Knew Node.js But Hadn't Run Services in Production

This is the one that doesn't show up in architecture retrospectives and probably should.

There's a real difference between a Node.js engineer who has built services and a Node.js engineer who has operated services. The building part is learnable quickly. The operating part — understanding what happens when services fail in combination, how to debug a distributed system under pressure, how to think about resilience and failure modes during design rather than after incidents — that comes from having been through it.

When you hire Node.js developers for a microservices environment specifically, the interview questions that matter are operational ones. What's the hardest production incident you've debugged? How do you think about failure modes when you're designing a service? What does a good health check actually check? Have you dealt with a cascading failure and what did you learn from it?

Engineers who have lived these situations bring instincts that are genuinely hard to develop without the experience. The team composition question matters as much as the technical architecture.

The Thing Nobody Tells You

Microservices move the complexity out of the code and into the infrastructure and the coordination. The code gets simpler — each service does less, is easier to reason about in isolation. But the system as a whole becomes harder to understand, harder to debug, and harder to change safely.

That's not an argument against microservices. It's an argument for going in with eyes open.

The teams that run microservices well aren't smarter than the ones that struggle. They've just built the operational discipline — tracing, circuit breakers, async communication, clear ownership, sensible service boundaries — into the system from the start rather than adding it reactively after things break.

Most of what's in this post we learned the reactive way. You don't have to.

If you're scaling a Node.js microservices architecture and need engineers who've already been through some version of this, working with Hyperlink InfoSystem to hire Node.js developers with real production exposure is a different conversation than sourcing engineers who've only built services in isolation. The operational experience shows up immediately on a team that needs it.