Service Mesh: The Sidecar Tax

#servicemesh #kubernetes #istio #architecture

The Invoice — Episode 19

"mTLS, observability, traffic management, zero-code retries. You need a service mesh."

Splendid. Let us examine what one is actually paying for.

A service mesh moves cross-cutting concerns (mTLS, retries, timeouts, traffic shifting, observability) out of application code and into a proxy that sits beside each pod. Istio, the archetype, launched in 2017 as a joint project of Google, IBM, and Lyft. It graduated within the CNCF in July 2023. In the 2024 CNCF Annual Survey, service-mesh adoption across respondents fell to 42 percent, down from 50 percent the year before. That is not a catastrophe. It is, however, the first full-year decline the category has ever posted.

The Complexity Invoice

Istio ships over a dozen primary custom resource definitions across three categories (traffic management, security, telemetry) and dozens more through its operator, telemetry plugins, wasm extensions, and gateway APIs. A minimally useful installation comprises:

A control plane (istiod) responsible for configuration distribution, certificate issuance, xDS API serving to every sidecar
A per-pod sidecar (Envoy) injected into every workload, running a second container alongside the application
An ingress gateway at the cluster edge, usually another Envoy in a standalone pod
mTLS certificates rotated by istiod, distributed via SDS to each sidecar
Policy resources (PeerAuthentication, RequestAuthentication, AuthorizationPolicy)
Telemetry bindings (Telemetry CRDs) to send traces and metrics to external collectors
A platform team that knows what each of those does, how they interact, and how to debug any given failure mode

The CNCF's own reports describe Istio as mature, powerful, and "operationally demanding". The second adjective is the one to watch. Installing Istio in a fresh cluster takes a senior SRE about two days. Operating it for six months takes roughly 0.5 to 1.0 FTE, scaling upwards with cluster size. Debugging it at 3 a.m. is a skill one acquires by losing two nights of sleep and one customer.

The Latency Invoice

Every inter-service HTTP or gRPC call now traverses two Envoy proxies: the caller's sidecar, then the callee's sidecar. Adding two proxies to every request path means adding latency. How much is now well-measured.

A 2025 peer-reviewed performance comparison (published by the DeepNess Lab, Performance Comparison of Service Mesh Frameworks: the mTLS Test Case) measured the overhead with mTLS enforced on otherwise identical workloads. The results:

Mesh	mTLS overhead vs. baseline
Istio (sidecar mode)	+166%
Cilium	+99%
Linkerd	+33%
Istio (ambient mode)	+8%

The headline number (+166 percent for Istio sidecar with mTLS) is surprising only to people who have never read the benchmark. Envoy is fast; two Envoys in the path plus TLS handshakes and certificate validation are not free. Linkerd's Rust-based linkerd2-proxy is measurably lighter because it was built for the job, not adapted to it. Ambient mode, introduced in Istio 1.23 (August 2024), replaces per-pod sidecars with a shared node-level ztunnel and produces dramatically less overhead. Ambient is, in effect, Istio's own public admission that the sidecar model had a problem it could not solve by optimisation alone.

A sidecar also costs memory. The Istio 1.24 documentation reports approximately 60 MB of RAM and 0.20 vCPU per Envoy sidecar at 1,000 HTTP RPS with 1 KB payloads. A cluster with 1,000 pods is therefore paying roughly 60 GB of RAM and 200 vCPU for the mesh before a single byte of application code has executed. Ambient ztunnels are smaller (approximately 12 MB RAM, 0.06 vCPU each) but you now also pay for waypoint proxies where L7 features are enabled. Either way, the total is non-zero.

The Debugging Invoice

When the mesh works, it is invisible. When it does not, the request path has doubled and so has the attack surface for bugs. A 500 that arrives at the client might originate in:

The application code itself
The caller's Envoy (wrong upstream cluster, circuit breaker tripped)
The destination's Envoy (connection limits, bad cert rotation)
A mis-parsed VirtualService or DestinationRule
The mTLS trust chain (expired intermediate, wrong trust domain)
istiod failing to push updated configuration within the retry window
A wasm plugin throwing an exception
A Kubernetes NetworkPolicy quietly dropping the packet

The distributed tracing one installed to understand the mesh is now required to understand the mesh. Troubleshooting skills become mesh-specific skills, which means they do not transfer and do not scale with engineer headcount in the obvious way.

The Honest Case For

Service meshes solve a real problem for a real set of operators. If you:

Run more than roughly 100 microservices with cross-team ownership
Have strict compliance that mandates mTLS between every internal service
Operate across multiple clusters or multiple clouds with incompatible primitives
Need uniform observability across polyglot services that cannot ship an OTel library

then the tax starts to pay for itself. Everyone else: you are paying Google's architecture to solve problems you do not, in fact, have.

The Alternative

Direct HTTP or gRPC calls between services, over a network one already trusts. This is how the internet worked for three decades before sidecars existed.

mTLS terminated at a single ingress gateway (HAProxy, NGINX, Envoy itself, or whatever one's load balancer of choice is), because the VPC was a trust boundary before sidecars were a marketing category. Internal traffic over plaintext inside the VPC is fine for the vast majority of workloads, and mTLS between services is a compliance requirement for a minority of them, not an architectural necessity for all of them.

Tracing and metrics via an OpenTelemetry library linked into each service. OTel is language-agnostic, vendor-neutral, and five lines of initialisation in most runtimes. It sends traces and metrics via OTLP to any collector. No proxy required.

Retries and timeouts in the client library (Go's http.Client, Rust's reqwest, Java's RestTemplate or OkHttp, Python's httpx, Node's undici). All of these ship configurable retries, timeouts, connection pools, and circuit breakers. The retry logic that a service mesh claims to provide "without code changes" is three lines of configuration in a mature client.

Authorisation at the application layer, because only the application knows what "this user may read this document" means. Delegating authorisation to a proxy is delegating it to a component that does not understand the data.

The Pattern

Service mesh is sold as "zero code changes". You get that by paying:

Two proxies of latency on every internal call, measurably more under mTLS
A platform team of overhead to run istiod, gateways, policies, and upgrades
A debugger's worth of new moving parts: VirtualService, DestinationRule, PeerAuthentication, Envoy configuration, trust chains, wasm plugins

All to avoid writing retry logic that any mature HTTP client already provides in three lines of configuration.

The mesh was always, architecturally, a political solution to a technical problem. It existed because microservice teams did not trust each other's code, and a proxy in the middle was a way of enforcing cross-cutting concerns without convincing any one team to adopt them. The proxy became the architecture. The architecture became the operational cost centre. The cost centre produced Ambient Mode, which is the industry's second try at making sidecars not cost what sidecars cost.

Meanwhile, the original alternative (a library in each service, a trusted network below, and a single ingress gateway at the edge) remained exactly what it has been since approximately 1995.

The side call was always there. One simply decided it wasn't enterprise enough.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.