DEV Community

Cover image for We replaced etcd with Google Cloud Spanner. Here's what happened.
Petr Petrenko
Petr Petrenko

Posted on

We replaced etcd with Google Cloud Spanner. Here's what happened.

spanner-etcd is an open source (Apache 2.0), drop-in etcd v3 replacement backed by Google Cloud Spanner. Same API, no client changes — just point --etcd-servers at it.

We built it because etcd has a fundamental scaling constraint: every write serializes through a single global revision counter. One row, one lock, every transaction waits in line. At 32 concurrent writers, that counter becomes the bottleneck.

The GKE team solved this years ago internally to scale Kubernetes to 65,000 nodes. Their implementation is closed. So we built an open one.

This is the story of how it works, what we got wrong, and the honest benchmark numbers.


The core idea: timestamps as revisions

etcd's revision is a monotonically increasing integer. Every write increments it. That increment is the serialization point.

Spanner has PENDING_COMMIT_TIMESTAMP() — a TrueTime-based timestamp assigned at commit time, globally unique, strictly monotonic across all transactions. No counter. No lock. Each transaction commits independently.

So instead of:

UPDATE kv_rev SET rev = rev + 1 WHERE id = 1;  -- everyone waits here
INSERT INTO kv (rev, key, value) VALUES (42, '/foo', 'bar');
Enter fullscreen mode Exit fullscreen mode

We do:

INSERT INTO kv (rev, key, value)
VALUES (PENDING_COMMIT_TIMESTAMP(), '/foo', 'bar');
Enter fullscreen mode Exit fullscreen mode

The revision is the commit timestamp, cast to int64 UnixNano. Valid etcd ModRevision. Zero contention.

Result: at ×32 concurrency, write throughput went from a serialized bottleneck to 673 ops/sec — 15× faster than the integer counter baseline.

PCT vs integer counter: serialized writes waiting in line vs parallel commits with PENDING_COMMIT_TIMESTAMP


Watch events via Change Streams

etcd Watch is a streaming API — clients subscribe to a prefix and receive events as writes happen. In a vanilla etcd replacement you'd poll. We tried that first: it worked, but ~1s latency felt wrong.

Spanner has Change Streams — a push-based CDC mechanism that delivers row changes within tens of milliseconds. We built a partition reader that:

  1. Starts streaming all partitions of kv_changes
  2. Persists partition cursors to Spanner every 5s (so replicas resume correctly after restart)
  3. Falls back to 1-second polling on the emulator (Change Streams aren't supported there)

The result: ~30ms Watch latency end-to-end on production Spanner in the same region. Not etcd's 1ms — Spanner is not a local in-memory store. But for Kubernetes workloads it's completely fine.


The back-join problem we almost missed

Early benchmarks showed Get at 71 ops/sec. Seemed reasonable. Then we looked at the query plan.

Our schema used PRIMARY KEY (id) with a bit_reversed_positive sequence — standard Spanner advice to avoid write hotspots. The secondary index kv_key_rev ON kv(key, rev DESC) existed for reads. But Spanner was doing this for every Get:

  1. Index scan kv_key_rev → find the row's id
  2. Table lookup on kv by id → fetch the actual data

Two round-trips inside one query. The fix was a single DDL change:

CREATE INDEX kv_key_rev ON kv (key, rev DESC)
  STORING (value, old_value, lease_id, deleted, created,
           create_revision, prev_revision);
Enter fullscreen mode Exit fullscreen mode

STORING copies all needed columns into the index. Spanner can now serve reads entirely from the index — no back-join.

Get improved +40%. Mixed workload improved +167%. Measured before and after on the same hardware.

We also added kv_rev_desc ON kv(rev DESC) so CurrentRevision() does an O(1) LIMIT 1 seek instead of a full MAX(rev) scan.

One caveat: STORING value where value is BYTES(MAX) doubles write amplification for large values. For Kubernetes workloads (mostly small JSON/protobuf objects) this is fine. For blob storage it would be a problem.


Stateless replicas

This is the part that feels almost too simple. Because all state lives in Spanner, every replica is completely stateless. No consensus. No leader election between replicas. No split-brain.

Stateless architecture: LoadBalancer routes to multiple spanner-etcd replicas, all reading and writing to a single Google Cloud Spanner instance

We tested this explicitly: Watch on replica 2, writes through replica 1, then killed replica 1. Replica 2 received all events — before and after the kill — with zero gaps. 45 Watch streams migrated in ~10s. Kubernetes didn't notice.

The only statefulness is the Change Stream cursor, persisted to Spanner itself and recovered on restart. No leader election, no quorum, no split-brain scenario possible.


Real numbers

Everything below is production Spanner (regional-us-central1, 1000 PU), same-region e2-standard-4 VM, not the emulator.

Throughput:

Operation ops/sec Latency
Create ×1 90 11.1ms
Create ×4 parallel 270 3.7ms
Get ×1 108 9.3ms
Get ×4 parallel 481 2.1ms
Mixed ×4 (70% read) 403 2.5ms
Watch latency ~30ms

How many Spanner PUs do you actually need?

We benchmarked at 100, 1000, and 2000 PU on us-central1:

Operation 100 PU 1000 PU 2000 PU
Create ×4 parallel 87 270 255
Get ×4 parallel 472 481 469
Mixed ×4 294 403 404
Watch latency 29ms ~30ms 30ms

Interesting findings:

  • Single-key ops are nearly identical across 1000 and 2000 PU — you're paying for network round-trip, not Spanner compute (CPU was ~1% during benchmarks)
  • Parallel writes fall off sharply at 100 PU — Create ×4 drops from 270 to 87 ops/sec
  • Watch latency is consistent at ~30ms across all tiers
  • 100 PU is enough for small clusters (< 100 nodes with moderate write rates)

Multi-region: what does global durability actually cost?

We ran one more experiment. We switched to nam6 — Iowa + South Carolina + Oregon + Los Angeles — and benchmarked from both regions.

The Spanner leader lives in Iowa. So writes from Iowa replicate synchronously to South Carolina before committing. Writes from South Carolina travel to Iowa, get committed, then come back.

Multi-region write cost: writes from Iowa leader replicate synchronously to South Carolina (~40ms penalty), writes from South Carolina travel round-trip to Iowa leader (~80ms)

Operation Regional Iowa nam6 Iowa nam6 S.Carolina nam6 S.Carolina + DR
Create ×1 90 53 11 11
Create ×4 parallel 270 203 45 45
Get ×1 108 116 14 16
Get ×4 parallel 481 577 60 64
Mixed ×4 403 327 49 53
Watch latency ~30ms 42ms 131ms 196ms

DR = --spanner-read-location=us-east1 — directed reads to the local South Carolina replica.

What this tells you:

Writes from Iowa get ~40% slower — that's the cost of synchronous replication to South Carolina. From South Carolina writes are 8× slower — each write travels Iowa→S.Carolina twice.

Directed reads improve read latency by 7-14% from South Carolina — reads go to the local replica instead of Iowa. The improvement is modest because writes still dominate the mixed workload, and Watch latency actually gets worse (Change Stream cursors still follow the leader path).

The practical conclusion: put your spanner-etcd replicas in the same region as the Spanner leader. If you need RPO=0 and must run replicas far from the leader, use --spanner-read-location to at least get reads locally. But writes will always pay the cross-region round-trip.


Kubernetes validation

We ran Kubernetes v1.33.12 (kubeadm, external etcd = spanner-etcd) for 24 hours straight:

  • Rolling deployments scaled 1–10 replicas every 2 minutes
  • ConfigMap churn every 3 minutes
  • cert-manager running concurrently
  • 57 active Watch streams throughout

Results: zero crashes, zero data loss, zero unimplemented errors. The Kubernetes node stayed Ready the entire time.

We also tested with 22 production Java/Kotlin microservices (Vert.x + jetcd) on GKE. Auth token expiry, pod kill, Watch stream migration — all clean.


What we didn't build (and why)

Auth RBAC (UserAdd, RoleAdd, GrantPermission) — Kubernetes doesn't use it. The API server manages its own RBAC. We implement Authenticate (username/password → token) because kubeadm requires it, but the full RBAC surface isn't needed.

Defrag / Snapshot — Spanner manages storage automatically. These operations don't have a meaningful equivalent.

Sub-10ms Watch latency — if you need this, spanner-etcd is the wrong tool. Change Streams have inherent latency. For most Kubernetes operations this doesn't matter — the API server isn't latency-sensitive to etcd Watch at the millisecond level.


What surprised us

The covering index made a bigger difference than the PCT revision change. We expected the write bottleneck removal to be the headline. It was. But the read path optimization nearly doubled read throughput and nearly tripled mixed workload numbers. Sometimes the boring infrastructure work matters more than the clever architectural idea.

100 PU is genuinely enough for most clusters. We expected a linear relationship between PU and performance. Instead we found that network latency dominates and Spanner CPU is barely touched. The PU floor matters for parallel writes, but a small cluster doesn't need 1000 PU.

Non-leader regions are expensive. Iowa→South Carolina adds ~80ms round-trip. In a multi-region setup, where you place your replicas relative to the Spanner leader matters a lot more than how many PUs you provision.


Try it

docker run --rm \
  -e SPANNER_DATABASE=projects/P/instances/I/databases/D \
  -p 2379:2379 \
  ghcr.io/n0rm4l-me/spanner-etcd:v0.1.0
Enter fullscreen mode Exit fullscreen mode

Or with kubeadm:

etcd:
  external:
    endpoints:
      - http://spanner-etcd:2379
Enter fullscreen mode Exit fullscreen mode

GitHub: github.com/n0rm4l-me/spanner-etcd

Top comments (1)

Collapse
 
iain_sinclair_8c3831da296 profile image
Iain Sinclair

Nice use case. My understanding is that using spanner to replace etcd is what allows GKE to scale to 65000 nodes (since GKE 1.31).