Service Discovery in 2026: Consul, etcd, and Kubernetes — Which Wins When

#microservices #architecture #devops #tutorial

Book: System Design Pocket Guide: Fundamentals
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A pattern you see often: a platform team spends six weeks moving from a homegrown Consul setup to "just use Kubernetes services." Two weeks after the cutover, on-call gets paged because a batch job in a separate VPC can no longer find the billing API. The cluster DNS only resolves inside the cluster, and the legacy worker fleet on EC2 had been quietly relying on Consul to render its config.

That story is the entire shape of service discovery in 2026. Three tools dominate. None is universally right. Each has a specific shape of pain.

This post is the cheat sheet for that exact shape of migration: Consul, etcd, and Kubernetes-native DNS, scored on the questions that actually decide which one you ship.

What service discovery has to do

Strip the marketing and four jobs are left.

Register. When an instance comes up, it appears in some catalog.
Health check. When an instance dies, it stops appearing in the catalog quickly enough that callers do not pile up against a corpse.
Resolve. A caller asks "where is payments?" and gets back a list of healthy endpoints.
Survive failure of the discovery layer itself. Clients keep working with stale data when the registry is unreachable, and the registry recovers without operator surgery.

Every comparison below maps back to those four. If a tool is great at three and your blast radius lives in the fourth, you have a bad fit.

Consul: the multi-datacenter answer

HashiCorp Consul is the only tool in this list designed from day one to be a service discovery system. Agents run on every node, talk gossip among themselves, and forward registrations to a Raft-replicated server cluster. Health checks are first-class (HTTP, TCP, gRPC, script, and TTL) and unhealthy instances drop out of the catalog without a control loop having to notice (Consul docs).

The reason teams pay for Consul's operational weight is two features the others do not have:

Multi-datacenter federation. Each DC runs its own server cluster; WAN gossip ties them together. A service in dc1 can resolve payments.service.dc2.consul natively. No federated control plane to glue together yourself.
ACLs and SPIFFE identity. Every service gets an identity, every call can be checked against a policy, and Connect (the built-in mesh) issues mTLS certs from those identities.

Where Consul costs you. The agent-on-every-node model means you operate a daemon you did not write across your fleet. ACL bootstrap is a documented sharp edge, and token rotation is the recurring outage story in Consul shops. And Consul Connect is a real service mesh; if you adopt it, you adopt a sidecar proxy and the additional per-hop latency that comes with it.

The minimum-viable Consul registration looks like this from a Go health server.

package main

import (
    "fmt"
    "net/http"

    capi "github.com/hashicorp/consul/api"
)

func main() {
    cli, _ := capi.NewClient(capi.DefaultConfig())
    reg := &capi.AgentServiceRegistration{
        ID:      "payments-1",
        Name:    "payments",
        Port:    8080,
        Address: "10.0.1.42",
        Check: &capi.AgentServiceCheck{
            HTTP:     "http://10.0.1.42:8080/healthz",
            Interval: "5s",
            Timeout:  "2s",
        },
    }
    if err := cli.Agent().ServiceRegister(reg); err != nil {
        panic(err)
    }
    http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintln(w, "ok")
    })
    http.ListenAndServe(":8080", nil)
}

Three things are happening. The service is registered with a stable ID. The agent is told how to check it (HTTP every 5 seconds, fail after 2). And the resolver inside the cluster, DNS at payments.service.consul or HTTP at /v1/health/service/payments?passing=true, will only return this instance while the check is green.

Pick Consul when you have mixed workloads (Kubernetes plus legacy VMs plus bare metal), multiple datacenters that must talk, or a regulatory story that needs SPIFFE-grade identity on every call. Skip it when you are 100% on Kubernetes inside a single region. You are buying a second control plane to replace one you already have.

etcd: the wrong question for most teams

etcd is the registry that backs Kubernetes, but it is not a service discovery tool you should be reaching for directly in 2026. Treat it as a building block: a Raft-consistent key-value store with a watch primitive. If you want service discovery on top of etcd, you write it yourself.

What etcd actually gives you. Linearizable reads. A watch API that streams key changes to subscribers, which is the real reason it underpins Kubernetes' controllers. Lease-based TTLs that let a node take a lease, attach its registration key to the lease, and have the key disappear when the lease times out. That last pattern is the closest etcd gets to a built-in registration model, but you are still wiring up health checks, change notifications to clients, and load distribution yourself.

The honest read: if you are choosing between Consul and etcd for service discovery, the answer is Consul. If you are choosing between etcd and "use Kubernetes services," the answer is Kubernetes services. The only good direct-use case for etcd is when you are building infrastructure that needs a strongly-consistent KV store and you happen to also want to register things in it. Kubernetes itself is the canonical example.

The trap I keep seeing: a team reads an old blog post about a hyperscaler using etcd for service discovery, copies the pattern, and rebuilds a worse Consul over six months. Consul exists. The work is done.

Kubernetes-native: DNS plus endpoints, almost free

If your services live entirely inside one or more Kubernetes clusters, you already have service discovery and you should not bolt anything on top of it without a reason.

The flow is boring on purpose. You run pods behind a Service. The kubelet's readiness probe gates each pod into the EndpointSlice for that Service. CoreDNS resolves payments.default.svc.cluster.local into a stable ClusterIP, and kube-proxy (or the eBPF data plane your CNI ships, typically Cilium) load-balances to the EndpointSlice members. When a pod fails its readiness probe, it drops out of the EndpointSlice within seconds.

A working registration in Kubernetes is a Deployment with a probe and a Service in front of it.

apiVersion: v1
kind: Service
metadata:
  name: payments
spec:
  selector:
    app: payments
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments
spec:
  replicas: 3
  selector:
    matchLabels: { app: payments }
  template:
    metadata:
      labels: { app: payments }
    spec:
      containers:
      - name: payments
        image: payments:1.4.2
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 5
          failureThreshold: 2

That is the entire registration story. No SDK, no agent on the node, no tokens to rotate. Callers do http://payments/ and the cluster DNS does the rest. Headless services give you per-pod DNS records when you need direct addressing for stateful systems.

Where Kubernetes-native service discovery falls down:

Cross-cluster. Out of the box, payments.default.svc.cluster.local only resolves inside the cluster. Multi-cluster service discovery means MCS, Submariner, Cilium ClusterMesh, or a Consul-style overlay. The standard is settling but not yet boring.
Off-cluster consumers. A worker fleet on EC2 cannot resolve cluster DNS without help. ExternalDNS plus a load balancer is the usual answer; it works, but you are now paying for two layers of indirection.
Stale endpoints during rolling failures. Endpoint propagation is fast but not instant (EndpointSlice termination conditions, KEP-1672). With aggressive client-side connection pooling, a caller can hold a TCP connection to a pod that already fell out of the EndpointSlice. The mitigation (connection-aging, client-side load balancing, an L7 mesh) is real engineering work.

Pick Kubernetes-native when you are inside one or a small number of homogeneous clusters and the boundary of "what speaks the protocol" is the cluster itself. The cost is essentially zero. The fit is excellent.

How to actually choose

Five questions, in order, and the answer falls out.

Are you fully inside Kubernetes, in one or two clusters? Use Service + DNS + readiness probes. Done.
Do you have VMs, bare metal, or another orchestrator alongside K8s? Consul. The agent-on-every-node model is exactly designed for this.
Do you span multiple datacenters that need to discover each other? Consul with WAN federation, or a service mesh with multi-cluster support layered on top.
Do you need identity-based authorization on every service call (regulatory or zero-trust)? Consul Connect or a Kubernetes-native mesh (Istio, Linkerd). The choice depends on which side of the K8s line your workloads live.
Are you tempted to "just use etcd"? Reread questions 1–4. The answer is one of those tools.

Most teams in the shape of the opener eventually land on Kubernetes-native discovery inside the cluster, plus a small Consul cluster federating the EC2 worker fleet, with Consul on the EC2 side and ExternalDNS bridging the K8s side. Two tools, each doing what it is good at, both honest about their boundaries. That is the 2026 answer for most teams that have been around long enough to have an EC2 worker fleet at all.

If this was useful

If you keep landing on the same handful of system-design questions during interviews, design reviews, or migration planning, the System Design Pocket Guide: Fundamentals covers the building blocks behind decisions like this one (service discovery, load balancing, partitioning, replication) at the level where you can actually reason about tradeoffs instead of repeating vendor copy.