I started Fly.io’s Gossip Glomers because I wanted a practical way into distributed systems.
Books were useful, but I wasn’t feeling the problems. Gossip Glomers fixed that.
It gave me tiny problems that looked simple, then failed in very non-obvious ways.
I’m still early in this journey, but here are the lessons that finally clicked for me.
What I built
I solved the challenges in Go:
- Echo
- Broadcast
- G-Counter
- Unique IDs
- Kafka-style log
- Transactional key-value (eventually consistent sync)
My stack was intentionally boring: Go + Maelstrom’s Go node library + JSON handlers.
Code
Aha #1: “Works locally” means nothing without retries + idempotency
Broadcast was my first real slap in the face.
My first thought was:
“Receive message → forward to neighbors → done.”
Then I realized:
- messages can be duplicated
- RPCs can fail
- peers can miss updates
- clients can race with propagation
This pattern was the turning point:
mu.Lock()
alreadySeen := false
for _, v := range message_list {
if v == req.Message {
alreadySeen = true
break
}
}
if !alreadySeen {
message_list = append(message_list, req.Message)
}
mu.Unlock()
if alreadySeen {
return n.Reply(msg, BroadcastResponse{Type: "broadcast_ok"})
}
Why this mattered:
- The
alreadySeencheck made rebroadcast safe. - I could retry RPCs without fear of corrupting state.
- “At-least-once delivery” became manageable because handlers were idempotent.
That was my first real distributed systems instinct:
Retries are useless unless duplicate handling is correct.
Aha #2: CAS loops are the backbone of safe shared updates
In G-Counter and Kafka-style log, I used compare-and-swap loops.
for {
curr, err := kv.ReadInt(context.Background(), key)
if keyMissing(err) {
curr = 0
} else if err != nil {
return err
}
next := curr + req.Delta
err = kv.CompareAndSwap(context.Background(), key, curr, next, true)
if err == nil {
break
}
}
Why it works:
- Read current value
- Compute new value
- Write only if nobody changed it since your read
- Retry if there was contention
This taught me something I’d only heard before:
Concurrency bugs are not fixed by optimism; they’re fixed by atomicity + retry.
Aha #3: topology is not an implementation detail
I used neighbor forwarding in broadcast and skipped sending back to the source.
Even that one small decision noticeably reduces message noise.
Tradeoff became obvious:
- more fanout → faster propagation, more network traffic
- less fanout → cheaper traffic, more staleness risk
Before this challenge, topology felt theoretical.
Now it feels like a direct lever on latency and cost.
Aha #4: consistency model changes everything you’re allowed to do
In my txn challenge, I used local writes + periodic state sync:
for _, txn := range req.Txn {
op, key := txn[0].(string), int(txn[1].(float64))
switch op {
case "r":
txn[2] = readLocal(store, key)
case "w":
store[key] = int(txn[2].(float64))
}
}
And sync:
for k, v := range req.State {
if currVal, exists := store[k]; !exists || v > currVal {
store[k] = v
}
}
This is great for availability and eventual convergence, but it’s not strict serializable behavior.
And that’s the lesson: your merge strategy defines your guarantees.
I used to treat consistency labels as abstract terms.
Now I see them as implementation consequences.
Things I messed up (so you don’t have to)
- I underestimated how often duplicate messages show up.
- I initially treated network failures like exceptional cases, not normal flow.
- I used a slice for dedupe in broadcast (fine early, but not ideal at scale).
- I learned the hard way that “read then write” without CAS is a race factory.
- I replied too early in some flows before thinking through visibility/staleness.
What I’d improve next
- Replace linear dedupe scan with a
map[int]struct{}in broadcast. - Add bounded retry/backoff instead of hot retry loops.
- Make txn merge semantics explicit (version vectors / timestamps / CRDT-style merge depending on workload).
- Capture and compare Maelstrom result artifacts more systematically between iterations.
Why this challenge was perfect for a beginner like me
Gossip Glomers gave me small, runnable problems where each “tiny” bug taught a core distributed systems rule.
Not by theory first.
By breaking first.
That worked really well for me.
If you’ve done Gossip Glomers too:
which challenge changed how you think the most — broadcast, counters, or txn?
Top comments (0)