DEV Community

Cover image for I used Gossip Glomers to learn distributed systems from zero (and got humbled fast)
Solo
Solo

Posted on

I used Gossip Glomers to learn distributed systems from zero (and got humbled fast)

I started Fly.io’s Gossip Glomers because I wanted a practical way into distributed systems.

Books were useful, but I wasn’t feeling the problems. Gossip Glomers fixed that.

It gave me tiny problems that looked simple, then failed in very non-obvious ways.

I’m still early in this journey, but here are the lessons that finally clicked for me.


What I built

I solved the challenges in Go:

  • Echo
  • Broadcast
  • G-Counter
  • Unique IDs
  • Kafka-style log
  • Transactional key-value (eventually consistent sync)

My stack was intentionally boring: Go + Maelstrom’s Go node library + JSON handlers.
Code


Aha #1: “Works locally” means nothing without retries + idempotency

Broadcast was my first real slap in the face.

My first thought was:

“Receive message → forward to neighbors → done.”

Then I realized:

  • messages can be duplicated
  • RPCs can fail
  • peers can miss updates
  • clients can race with propagation

This pattern was the turning point:

mu.Lock()
alreadySeen := false
for _, v := range message_list {
    if v == req.Message {
        alreadySeen = true
        break
    }
}
if !alreadySeen {
    message_list = append(message_list, req.Message)
}
mu.Unlock()

if alreadySeen {
    return n.Reply(msg, BroadcastResponse{Type: "broadcast_ok"})
}
Enter fullscreen mode Exit fullscreen mode

Why this mattered:

  • The alreadySeen check made rebroadcast safe.
  • I could retry RPCs without fear of corrupting state.
  • “At-least-once delivery” became manageable because handlers were idempotent.

That was my first real distributed systems instinct:

Retries are useless unless duplicate handling is correct.


Aha #2: CAS loops are the backbone of safe shared updates

In G-Counter and Kafka-style log, I used compare-and-swap loops.

for {
    curr, err := kv.ReadInt(context.Background(), key)
    if keyMissing(err) {
        curr = 0
    } else if err != nil {
        return err
    }

    next := curr + req.Delta
    err = kv.CompareAndSwap(context.Background(), key, curr, next, true)
    if err == nil {
        break
    }
}
Enter fullscreen mode Exit fullscreen mode

Why it works:

  • Read current value
  • Compute new value
  • Write only if nobody changed it since your read
  • Retry if there was contention

This taught me something I’d only heard before:

Concurrency bugs are not fixed by optimism; they’re fixed by atomicity + retry.


Aha #3: topology is not an implementation detail

I used neighbor forwarding in broadcast and skipped sending back to the source.

Even that one small decision noticeably reduces message noise.

Tradeoff became obvious:

  • more fanout → faster propagation, more network traffic
  • less fanout → cheaper traffic, more staleness risk

Before this challenge, topology felt theoretical.

Now it feels like a direct lever on latency and cost.


Aha #4: consistency model changes everything you’re allowed to do

In my txn challenge, I used local writes + periodic state sync:

for _, txn := range req.Txn {
    op, key := txn[0].(string), int(txn[1].(float64))
    switch op {
    case "r":
        txn[2] = readLocal(store, key)
    case "w":
        store[key] = int(txn[2].(float64))
    }
}
Enter fullscreen mode Exit fullscreen mode

And sync:

for k, v := range req.State {
    if currVal, exists := store[k]; !exists || v > currVal {
        store[k] = v
    }
}
Enter fullscreen mode Exit fullscreen mode

This is great for availability and eventual convergence, but it’s not strict serializable behavior.

And that’s the lesson: your merge strategy defines your guarantees.

I used to treat consistency labels as abstract terms.

Now I see them as implementation consequences.


Things I messed up (so you don’t have to)

  • I underestimated how often duplicate messages show up.
  • I initially treated network failures like exceptional cases, not normal flow.
  • I used a slice for dedupe in broadcast (fine early, but not ideal at scale).
  • I learned the hard way that “read then write” without CAS is a race factory.
  • I replied too early in some flows before thinking through visibility/staleness.

What I’d improve next

  • Replace linear dedupe scan with a map[int]struct{} in broadcast.
  • Add bounded retry/backoff instead of hot retry loops.
  • Make txn merge semantics explicit (version vectors / timestamps / CRDT-style merge depending on workload).
  • Capture and compare Maelstrom result artifacts more systematically between iterations.

Why this challenge was perfect for a beginner like me

Gossip Glomers gave me small, runnable problems where each “tiny” bug taught a core distributed systems rule.

Not by theory first.

By breaking first.

That worked really well for me.

If you’ve done Gossip Glomers too:

which challenge changed how you think the most — broadcast, counters, or txn?

Top comments (0)