How to Build Real-Time Collaborative Features in Go with CRDTs and Operational Transformation

#programming #devto #go #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Building real-time collaborative features, like multiple people editing the same document from different locations, is a fascinating challenge. The core problem is simple to state but tricky to solve: how do you keep everything in sync when there’s no single source of truth and network messages can arrive in any order? I want to show you how we can solve this in Go using two powerful ideas that provide solid mathematical guarantees.

Let me start with the basic problem. Imagine two people, Alice and Bob, editing a shared counter. Their screens both show the number 5. Alice clicks a button to add 1. At nearly the same moment, Bob clicks his button to add 2. What should the final value be? It should be 8 (5+1+2). But if their applications just send messages like "set to 6" and "set to 7," one update will overwrite the other, and someone's work will be lost. We need a system where operations combine instead of clash.

This is where Conflict-free Replicated Data Types, or CRDTs, come in. They are special data structures designed from the ground up for this environment. The "conflict-free" part is the goal. Their internal logic ensures that no matter when or in what order updates are processed, all copies of the data will eventually look the same. Let's build a few.

First, a Grow-Only Counter, or G-Counter. It can only increase. Each user, or "replica," in the system has its own tally. The total count is the sum of all individual tallies. Because addition is commutative (1+2 is the same as 2+1), the order of updates doesn't matter. All replicas will sum to the same total.

type GCounter struct {
    mu     sync.RWMutex
    counts map[string]uint64 // Tally per replica
    total  uint64
}

func (gc *GCounter) Increment(replicaID string) uint64 {
    gc.mu.Lock()
    defer gc.mu.Unlock()
    gc.counts[replicaID]++
    gc.total++
    return gc.counts[replicaID]
}

func (gc *GCounter) Value() uint64 {
    gc.mu.RLock()
    defer gc.mu.RUnlock()
    return gc.total
}

To sync two of these counters, you merge them. The merge function takes the maximum count seen from each replica. This is a monotonic operation—it only moves forward, never backward, which is key to convergence.

func (gc *GCounter) Merge(other *GCounter) {
    gc.mu.Lock()
    defer gc.mu.Unlock()
    for replicaID, count := range other.counts {
        if current, exists := gc.counts[replicaID]; !exists || count > current {
            gc.total += count - current
            gc.counts[replicaID] = count
        }
    }
}

A G-Counter is "state-based." We periodically send its entire counts map to other replicas. They merge it with their own state. This is robust and simple. But sending the whole state can be inefficient for large or frequently changing data.

Now, what if we need a counter that can also decrease? We can build a Positive-Negative Counter (PN-Counter) from two G-Counters: one for increments, one for decrements. The value is the difference.

type PNCounter struct {
    inc *GCounter
    dec *GCounter
}

func (pnc *PNCounter) Value() int64 {
    return int64(pnc.inc.Value()) - int64(pnc.dec.Value())
}

func (pnc *PNCounter) Merge(other *PNCounter) {
    pnc.inc.Merge(other.inc)
    pnc.dec.Merge(other.dec)
}

The merge logic works the same way. We merge the increment counters and the decrement counters separately. The commutative property of addition and subtraction still holds, guaranteeing consistency.

For collaborative lists or text, things get more interesting. A simple approach is a Last-Write-Wins Register (LWW-Register). It holds a single value (like a document's title). Every change is stamped with a timestamp and the replica ID. The "winner" is the change with the highest timestamp, using the replica ID to break ties.

type LWWRegister struct {
    mu        sync.RWMutex
    value     interface{}
    timestamp int64
    replicaID string
}

func (reg *LWWRegister) Merge(other *LWWRegister) {
    reg.mu.Lock()
    defer reg.mu.Unlock()
    if other.timestamp > reg.timestamp ||
        (other.timestamp == reg.timestamp && other.replicaID > reg.replicaID) {
        reg.value = other.value
        reg.timestamp = other.timestamp
        reg.replicaID = other.replicaID
    }
}

This is straightforward but can feel arbitrary to users. For a sequence of items—like the characters in a text document—we need to preserve user intent. If Alice types "A" at the beginning and Bob types "B" at the beginning, we want the result to be "AB" or "BA," not for one character to disappear. This requires tracking causality, or the "happened-before" relationship between operations.

We use a tool called a Vector Clock for this. It's a logical clock per replica. Each time a replica makes a change, it increments its own counter. A vector clock is essentially a map of replica IDs to their latest known sequence numbers.

type VectorClock struct {
    mu      sync.RWMutex
    entries map[string]uint64
}

func (vc *VectorClock) Increment(replicaID string) {
    vc.mu.Lock()
    defer vc.mu.Unlock()
    vc.entries[replicaID]++
}

When we compare two vector clocks, we can determine the relationship between events. If all counters in Clock A are less than or equal to those in Clock B, then A happened before B. If some are greater and some are less, the events are concurrent—they didn't know about each other. This is crucial for resolving conflicts.

Now, for collaborative text editing, we can use an operation-based approach combined with a sequence CRDT. A common one is the Replicated Growable Array (RGA). Think of the text as a linked list of characters, each with a unique ID. When you insert a character, you place it after a specific predecessor ID. Deletions don't remove the node; they just mark it as a "tombstone."

type RGANode struct {
    id        string
    value     rune
    timestamp int64
    replicaID string
    next      []string // List of successor IDs
    deleted   bool
}

Inserting 'X' after node 'A' means setting 'A'.next to include the ID of 'X'. If two users insert different characters concurrently after 'A', both IDs end up in 'A'.next list. To have a consistent order, we sort these concurrent inserts by their timestamp and replica ID. The result is that all users will eventually see the characters in the same order, even if they arrived in different orders over the network.

func (rga *RGA) Insert(afterID string, value rune, replicaID string) string {
    rga.mu.Lock()
    defer rga.mu.Unlock()
    nodeID := fmt.Sprintf("%s-%d", replicaID, time.Now().UnixNano())
    node := &RGANode{
        id:        nodeID,
        value:     value,
        timestamp: time.Now().UnixNano(),
        replicaID: replicaID,
    }
    prev := rga.nodes[afterID]
    // Add new node ID to the predecessor's list of next nodes
    prev.next = append(prev.next, nodeID)
    // Sort concurrent next nodes by (timestamp, replicaID) for consistent order
    sort.Slice(prev.next, func(i, j int) bool {
        ni := rga.nodes[prev.next[i]]
        nj := rga.nodes[prev.next[j]]
        if ni.timestamp != nj.timestamp {
            return ni.timestamp < nj.timestamp
        }
        return ni.replicaID < nj.replicaID
    })
    rga.nodes[nodeID] = node
    return nodeID
}

This is a simplified view. Production RGAs, like those in the Logoot or Treedoc algorithms, are more complex but follow similar principles. They ensure that all inserts find a globally unique position in the sequence.

Operation-based CRDTs broadcast each operation (like "insert 'X' after node ID 'abc'"). This is great for real-time feedback. But a raw insert from Alice might need adjustment if, by the time it reaches Bob, Bob has already deleted the node 'abc'. We need to transform Alice's operation to account for Bob's intervening delete.

This is Operational Transformation (OT). It's a family of algorithms that adjust incoming operations based on the history of what's already happened locally. The goal is to preserve the user's original intention. If Alice typed "X" after a word that Bob deleted, a good OT algorithm might place "X" at the beginning of the next paragraph instead of just failing.

Implementing OT correctly is complex. It often requires a central server to sequence operations or a sophisticated peer-to-peer protocol. However, when combined with the causal tracking of CRDTs, it can make collaborative editing feel very smooth and natural.

Let's put these pieces together into a synchronization manager. This struct orchestrates the whole process: applying local changes, transforming concurrent operations, updating CRDTs, and communicating with peers.

type SyncManager struct {
    crdtStore    *CRDTStore
    transformer  *OperationTransformer
    peers        *PeerRegistry
    localReplicaID string
}

func (sm *SyncManager) ApplyLocalChange(op Operation) error {
    // 1. Transform the op against any concurrent ops from other peers
    transformedOp, err := sm.transformer.Transform(op, sm.GetConcurrentOps())
    if err != nil {
        return err
    }
    // 2. Apply the transformed op to the local CRDT store
    sm.crdtStore.Apply(transformedOp)
    // 3. Broadcast the transformed op to all connected peers
    sm.BroadcastToPeers(transformedOp)
    return nil
}

func (sm *SyncManager) ReceiveRemoteOp(op Operation) error {
    // 1. This op may have been transformed already by the sender.
    // We may need to transform it against our own local history.
    transformedOp, err := sm.transformer.TransformIncoming(op, sm.GetLocalHistory())
    if err != nil {
        return err
    }
    // 2. Apply it to our local CRDT store.
    sm.crdtStore.Apply(transformedOp)
    return nil
}

The OperationTransformer is the brain for OT. Its Transform function takes two operations that happened concurrently and computes a new version of the first operation that makes sense in a timeline where the second operation happened first.

type OperationTransformer struct {
    history *OperationHistory
}

// Transform adjusts op1 given that op2 has already been applied.
func (ot *OperationTransformer) Transform(op1, op2 Operation) (Operation, error) {
    // Example simple rule for text insertions:
    // If op2 deleted the character before where op1 wants to insert,
    // adjust op1's insertion position.
    if op1.Type == "insert" && op2.Type == "delete" {
        if op1.Position > op2.Position {
            op1.Position--
        }
    }
    return op1, nil
}

Real-world OT algorithms for text, like those used in Google Docs, are far more detailed. They handle multiple cursors, formatting changes, and complex edits. Implementing them fully is a major undertaking. The code above just illustrates the concept.

For peer-to-peer networking, WebSockets are a good fit for maintaining persistent, two-way connections. When a connection is established, peers first exchange their full vector clocks or recent operation histories to get in sync. Then, they stream operations as they happen.

type PeerConnection struct {
    ws       *websocket.Conn
    outbound chan Operation
    peerID   string
}

func (pc *PeerConnection) StartSync() {
    // Send our current state vector clock
    pc.ws.WriteJSON(sm.GetVectorClock())
    // Then start sending and receiving ops
    go pc.readPump()
    go pc.writePump()
}

Handling network partitions is critical. A replica might go offline for an hour, make changes, and then reconnect. When it reconnects, it needs to send its vector clock. Other replicas can then send all operations that happened after that point in logical time. This is why we keep an operation history, at least for a while.

There are practical considerations for production systems. You need to set bounds on operation history to avoid unlimited memory growth. You might implement a compaction mechanism: once all known replicas have acknowledged all operations up to a certain vector clock, you can discard that part of the history. The current CRDT state is the source of truth.

You also need to think about security and access control. In a collaborative document, not every user may have permission to delete sections. These permissions must be enforced at the operation level before they are applied to the CRDT.

Monitoring is also important. You should track metrics like synchronization latency (how long for an edit to appear on another user's screen), operation conflict rates, and the size of the operation history. This data helps you tune performance and understand user experience.

In my experience, starting with state-based CRDTs like counters, sets, and registers is the easiest path. They are conceptually simpler and very robust. For collaborative text, you might begin with a library implementing a proven sequence CRDT like Automerge or Yjs, as building a production-ready one from scratch is a significant project.

The beauty of this approach is its decentralization. You remove the single point of failure and scalability bottleneck of a central coordination server. Each peer is master of its own state. The mathematical properties of the data structures ensure that as long as peers can eventually communicate, they will agree on the final state. This pattern powers real-time collaboration in applications from code editors to design tools, providing the seamless experience users now expect.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!