Solo

Posted on Mar 4

I used Gossip Glomers to learn distributed systems from zero (and got humbled fast)

#distributedsystems #programming #beginners #opensource

I started Fly.io’s Gossip Glomers because I wanted a practical way into distributed systems.

Books were useful, but I wasn’t feeling the problems. Gossip Glomers fixed that.

It gave me tiny problems that looked simple, then failed in very non-obvious ways.

I’m still early in this journey, but here are the lessons that finally clicked for me.

What I built

I solved the challenges in Go:

Echo
Broadcast
G-Counter
Unique IDs
Kafka-style log
Transactional key-value (eventually consistent sync)

My stack was intentionally boring: Go + Maelstrom’s Go node library + JSON handlers.
Code

Aha #1: “Works locally” means nothing without retries + idempotency

Broadcast was my first real slap in the face.

My first thought was:

“Receive message → forward to neighbors → done.”

Then I realized:

messages can be duplicated
RPCs can fail
peers can miss updates
clients can race with propagation

This pattern was the turning point:

mu.Lock()
alreadySeen := false
for _, v := range message_list {
    if v == req.Message {
        alreadySeen = true
        break
    }
}
if !alreadySeen {
    message_list = append(message_list, req.Message)
}
mu.Unlock()

if alreadySeen {
    return n.Reply(msg, BroadcastResponse{Type: "broadcast_ok"})
}

Why this mattered:

The alreadySeen check made rebroadcast safe.
I could retry RPCs without fear of corrupting state.
“At-least-once delivery” became manageable because handlers were idempotent.

That was my first real distributed systems instinct:

Retries are useless unless duplicate handling is correct.

Aha #2: CAS loops are the backbone of safe shared updates

In G-Counter and Kafka-style log, I used compare-and-swap loops.

for {
    curr, err := kv.ReadInt(context.Background(), key)
    if keyMissing(err) {
        curr = 0
    } else if err != nil {
        return err
    }

    next := curr + req.Delta
    err = kv.CompareAndSwap(context.Background(), key, curr, next, true)
    if err == nil {
        break
    }
}

Why it works:

Read current value
Compute new value
Write only if nobody changed it since your read
Retry if there was contention

This taught me something I’d only heard before:

Concurrency bugs are not fixed by optimism; they’re fixed by atomicity + retry.

Aha #3: topology is not an implementation detail

I used neighbor forwarding in broadcast and skipped sending back to the source.

Even that one small decision noticeably reduces message noise.

Tradeoff became obvious:

more fanout → faster propagation, more network traffic
less fanout → cheaper traffic, more staleness risk

Before this challenge, topology felt theoretical.

Now it feels like a direct lever on latency and cost.

Aha #4: consistency model changes everything you’re allowed to do

In my txn challenge, I used local writes + periodic state sync:

for _, txn := range req.Txn {
    op, key := txn[0].(string), int(txn[1].(float64))
    switch op {
    case "r":
        txn[2] = readLocal(store, key)
    case "w":
        store[key] = int(txn[2].(float64))
    }
}

And sync:

for k, v := range req.State {
    if currVal, exists := store[k]; !exists || v > currVal {
        store[k] = v
    }
}

This is great for availability and eventual convergence, but it’s not strict serializable behavior.

And that’s the lesson: your merge strategy defines your guarantees.

I used to treat consistency labels as abstract terms.

Now I see them as implementation consequences.

Things I messed up (so you don’t have to)

I underestimated how often duplicate messages show up.
I initially treated network failures like exceptional cases, not normal flow.
I used a slice for dedupe in broadcast (fine early, but not ideal at scale).
I learned the hard way that “read then write” without CAS is a race factory.
I replied too early in some flows before thinking through visibility/staleness.

What I’d improve next

Replace linear dedupe scan with a map[int]struct{} in broadcast.
Add bounded retry/backoff instead of hot retry loops.
Make txn merge semantics explicit (version vectors / timestamps / CRDT-style merge depending on workload).
Capture and compare Maelstrom result artifacts more systematically between iterations.

Why this challenge was perfect for a beginner like me

Gossip Glomers gave me small, runnable problems where each “tiny” bug taught a core distributed systems rule.

Not by theory first.

By breaking first.

That worked really well for me.

If you’ve done Gossip Glomers too:

which challenge changed how you think the most — broadcast, counters, or txn?

DEV Community