Rajkiran

Posted on Jun 12

System Design - 17. Service Discovery & Service Mesh: How Thousands of Services Find Each Other

#architecture #systemdesign #software #design

Covers: Client-Side vs Server-Side Discovery, Service Registries, Service Mesh (Istio/Envoy), Kubernetes DNS

The Problem That Didn't Exist in the Monolith

In a monolith, calling another module is simple: OrderService.create(data). It's a function call. The compiler resolves the address. It always works (assuming the code compiles).

In microservices, "calling another service" means: where is it, right now, on the network?

This sounds trivial until you consider what's actually happening in a production environment:

Services run on dynamically allocated IPs (containers get new IPs every restart)
Services scale up and down constantly (auto-scaling adds/removes instances every few minutes)
Services deploy multiple times per day (new versions get new instances)
A single logical service might have 50 running instances across multiple servers

Hardcoding IP addresses is impossible. Even a config file with IPs would be stale within minutes. This is the problem service discovery solves.

The Two Models of Service Discovery

Client-Side Discovery

The calling service queries a service registry directly, gets a list of healthy instances, and load-balances between them itself.

Order Service wants to call Payment Service:

1. Order Service → Service Registry: "Where is Payment Service?"
2. Service Registry → returns: [10.0.1.5:8080, 10.0.1.6:8080, 10.0.1.7:8080]
3. Order Service → picks one (round-robin/random) → 10.0.1.6:8080
4. Order Service → calls Payment Service directly at 10.0.1.6:8080

┌──────────────┐     1. "Where's Payment Service?"    ┌──────────────┐
│ Order Service │ ───────────────────────────────────► │   Registry   │
│              │ ◄─────────────────────────────────── │   (Eureka)   │
│              │     2. [list of healthy instances]    └──────────────┘
│              │
│              │     3. Direct call (load-balanced     ┌──────────────┐
│              │ ───── client-side) ──────────────────►│Payment Service│
└──────────────┘                                        │  (instance 2) │
                                                          └──────────────┘

Real example: Netflix Eureka

Every service registers itself with Eureka on startup:

@EnableEurekaClient
@SpringBootApplication
public class PaymentServiceApplication {
    // On startup, this service registers with Eureka:
    // "I'm payment-service, I'm at 10.0.1.6:8080, I'm healthy"
}

Other services query Eureka and use Ribbon (Netflix's client-side load balancer) to pick an instance and call it directly.

Advantages:

No extra network hop (client calls the service directly)
Client has full control over load-balancing strategy

Disadvantages:

Every service needs discovery client logic — coupling every service to the registry's API and SDK
Multi-language environments need a discovery library for each language

Server-Side Discovery

The calling service makes a request to a load balancer, which queries the registry and routes the request. The caller never sees individual instance addresses.

Order Service wants to call Payment Service:

1. Order Service → calls "payment-service.internal" (a fixed name)
2. Load Balancer → queries registry for healthy Payment instances
3. Load Balancer → routes to one instance
4. Response flows back through the Load Balancer to Order Service

┌──────────────┐                    ┌──────────────┐                   ┌──────────────┐
│ Order Service │ ──── "payment-    │ Load Balancer │ ── queries ──────►│   Registry   │
│              │     service" ─────►│   (AWS ALB)   │ ◄── instance list─│   (AWS ECS)  │
└──────────────┘                    └───────┬──────┘                   └──────────────┘
                                              │
                                              ▼
                                     ┌──────────────┐
                                     │Payment Service│
                                     │  (instance 2)│
                                     └──────────────┘

Real example: AWS ALB + ECS

ECS (container orchestration) automatically registers/deregisters container instances with the ALB's target group as they start/stop. The Order Service simply calls a fixed DNS name — payment-service.internal — and AWS handles everything else.

Advantages:

Calling services need zero discovery logic — just call a fixed name
Language-agnostic — works the same for Java, Python, Go, anything
Centralized load-balancing logic, easier to update

Disadvantages:

Extra network hop (through the load balancer)
The load balancer itself must be highly available

Service Registry: The Source of Truth

Whichever model you use, there's a registry maintaining the live list of service instances. Popular implementations:

Consul (HashiCorp):

Service registration via agent on each host
Built-in health checking
DNS and HTTP interfaces for querying
Multi-datacenter support

etcd:

Distributed key-value store (also used as Kubernetes' backing store)
Services write their address to a key; watchers detect changes
Strongly consistent (uses Raft consensus)

ZooKeeper:

One of the oldest solutions (used by Kafka, Hadoop for coordination)
Strong consistency guarantees
More operationally complex than Consul/etcd

The registration lifecycle:

1. Service instance starts up
2. Registers itself: "I'm payment-service-7, at 10.0.1.6:8080, healthy"
3. Periodically sends heartbeats: "still alive"
4. Registry monitors heartbeats
5. If heartbeats stop (instance crashed) → registry marks instance unhealthy
6. After grace period → instance removed from registry entirely

Deregistration on graceful shutdown:
1. Instance receives SIGTERM (shutdown signal)
2. Instance explicitly deregisters from registry FIRST
3. Instance finishes in-flight requests (connection draining)
4. Instance exits
   → Other services stop routing new requests to it immediately,
     rather than waiting for heartbeat timeout (which could take 30+ seconds)

This deregistration-on-failure detail matters a lot in interviews — the difference between graceful shutdown (instant deregistration) and crash (timeout-based detection) determines how quickly your system "heals" after instance churn.

Kubernetes: Service Discovery Built In

If you're running Kubernetes, you largely don't think about service discovery — it's built into the platform via DNS.

# Define a Service — a stable name for a set of pods
apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment       # Matches pods with label app=payment
  ports:
    - port: 8080

Any pod in the cluster can now call:
  http://payment-service:8080

Kubernetes DNS (CoreDNS) resolves "payment-service" 
→ to the Service's virtual IP (ClusterIP)
→ kube-proxy load-balances to one of the matching pod IPs

How it works under the hood:

Kubernetes maintains a list of "Endpoints" — the actual pod IPs matching the Service's selector
As pods are created/destroyed (scaling, deployments, crashes), the Endpoints list updates automatically
kube-proxy on each node maintains iptables/IPVS rules that load-balance traffic to current Endpoints
DNS resolution + load balancing happens transparently — application code just calls http://payment-service:8080

This is server-side discovery, fully managed by the platform. It's a major reason Kubernetes became the dominant orchestration platform — service discovery, one of the hardest microservices problems, is solved by default.

Service Mesh: Discovery Is Just the Beginning

Once you have many services, you face a recurring set of cross-cutting problems for every service-to-service call:

How do I discover the target service? (discovery)
Is this connection encrypted? (mTLS)
What if the call fails — retry? How many times?
What if the target is overloaded — circuit break?
How do I trace this request across services?
How do I roll out a new version to 5% of traffic first (canary)?

Implementing all of this inside every service's application code means every team reimplements (or imports a library for) the same logic, in every language they use.

A service mesh moves all of this into infrastructure — typically a sidecar proxy deployed alongside every service instance.

┌─────────────────────────┐     ┌─────────────────────────┐
│   Order Service Pod      │     │  Payment Service Pod      │
│  ┌───────────┐ ┌───────┐│     │┌───────┐  ┌───────────┐  │
│  │   Order    │ │ Envoy ││     ││ Envoy │  │   Payment   │  │
│  │ Container  │◄┤Sidecar├┼─────┼┤Sidecar│◄─┤  Container  │  │
│  └───────────┘ └───────┘│     │└───────┘  └───────────┘  │
└─────────────────────────┘     └─────────────────────────┘
       Application code never talks to network directly —
       Envoy sidecar intercepts ALL traffic in and out

Every request from Order Service to Payment Service actually goes:

Order Container → Order's Envoy sidecar → Payment's Envoy sidecar → Payment Container

The application code is unaware — it just makes a normal HTTP call to localhost or a service name. The sidecar handles everything else.

What Istio/Envoy Handles Transparently

mTLS (mutual TLS):
Every connection between services is automatically encrypted and authenticated — without any application code changes. Each service gets a cryptographic identity.

Retries with backoff:

# Istio VirtualService config — no app code changes needed
retries:
  attempts: 3
  perTryTimeout: 2s
  retryOn: 5xx,connect-failure

Circuit breaking:

trafficPolicy:
  connectionPool:
    http:
      maxRequestsPerConnection: 10
  outlierDetection:
    consecutive5xxErrors: 5
    interval: 30s
    baseEjectionTime: 30s
    # After 5 consecutive 5xx errors, eject this instance for 30s

Traffic splitting (canary deployments):

http:
  - route:
    - destination:
        host: payment-service
        subset: v1
      weight: 90
    - destination:
        host: payment-service
        subset: v2     # new version
      weight: 10        # 10% of traffic to test the new version

Distributed tracing:
Every sidecar automatically adds trace headers and reports spans to Jaeger/Zipkin — without any application instrumentation.

Service Mesh vs API Gateway: The Confusion Cleared Up

These get confused constantly. Here's the clean distinction:

            External traffic
                  │
                  ▼
          ┌──────────────┐
          │  API Gateway  │  ← North-South traffic
          │  (Kong, ALB)  │     (outside world → your cluster)
          └──────┬───────┘
                  │
    ┌─────────────┼─────────────┐
    ▼             ▼             ▼
[Service A]──►[Service B]──►[Service C]
    ↑─────────────↑─────────────↑
         Service Mesh (Istio)     ← East-West traffic
         (service ←→ service,      (inside your cluster)
          all sidecar-mediated)

API Gateway: Handles North-South traffic — requests entering your system from outside (browsers, mobile apps, partner integrations). Concerns: public authentication, rate limiting per API key, request transformation for external contracts.

Service Mesh: Handles East-West traffic — requests between your internal services. Concerns: mTLS, internal retries/circuit breaking, service-to-service authorization, internal observability.

They're complementary, not competing. A request might pass through the API Gateway once (entering the system) and then through the service mesh multiple times (as it's processed by several internal services).

The Cost of a Service Mesh

Service meshes solve real problems, but they're not free:

Resource overhead: Every pod now runs an extra sidecar container — additional CPU/memory per service instance. At thousands of pods, this is a meaningful infrastructure cost.

Latency overhead: Every call now passes through two sidecars (sender's and receiver's) instead of going directly. Typically adds 1-3ms per hop — usually negligible, but compounds across deep call chains.

Operational complexity: Istio itself is a complex distributed system. Debugging "why is this request slow" now involves understanding sidecar configuration, not just application code.

The honest guidance: Service meshes make sense when you have dozens to hundreds of services and the cross-cutting concerns (mTLS, retries, observability) are genuinely painful to implement per-service. For 5-10 services, the operational cost of running Istio often exceeds the benefit — application-level libraries (like Resilience4j for circuit breaking, covered in Topic 18) may be simpler.

Interview Scenario: "Client-Side vs Server-Side Discovery — Which Would You Choose?"

"It depends on the team's technology diversity and operational maturity. If the organization is running Kubernetes, server-side discovery via Kubernetes Services is essentially free — DNS-based, language-agnostic, and requires zero application code. I'd default to that.

Client-side discovery (like Eureka) made more sense in the pre-Kubernetes era, or in environments without a unified orchestration platform, because it avoids the extra network hop through a load balancer. But it requires every service — in every language — to integrate a discovery client, which becomes a maintenance burden in polyglot environments.

For the broader cross-cutting concerns — retries, circuit breaking, mTLS — I'd evaluate whether a service mesh is justified by the number of services. Below ~10-15 services, I'd handle these concerns with application-level libraries. Beyond that, the consistency and language-agnostic benefits of a service mesh like Istio typically outweigh its operational and latency overhead."

Key Takeaways

Service discovery solves the problem of finding service instances in a dynamic environment where IPs change constantly.
Client-side discovery (Eureka): caller queries registry, load-balances itself. No extra hop, but requires discovery libraries per language.
Server-side discovery (AWS ALB, Kubernetes Services): caller hits a fixed name, infrastructure routes. Language-agnostic, adds one hop.
Kubernetes provides server-side discovery via DNS automatically — a major reason for its dominance.
Service mesh (Istio/Envoy) moves cross-cutting concerns — mTLS, retries, circuit breaking, tracing, canary routing — into sidecar proxies, out of application code.
API Gateway handles North-South traffic (external → internal). Service Mesh handles East-West traffic (internal → internal). They're complementary.
Service meshes add real overhead (resources, latency, complexity) — justify their use by service count and operational pain, not by trend-following.

What's Next

Topic 18 closes Day 6 with Fault Tolerance Patterns — Circuit Breakers, Retries with Exponential Backoff and Jitter, Bulkheads, and Timeouts. The patterns that determine whether a single failing service takes down your entire platform, or fails gracefully and recovers on its own.

Tags: system-design microservices service-mesh kubernetes backend distributed-systems interview-prep

DEV Community