DEV Community

ude-p
ude-p

Posted on

Profiling gorilla/websocket fan-out bottlenecks in Go

I have been working on a realtime multiplayer server in Go.

The transport layer is WebSocket using gorilla/websocket, and most outbound payloads are protobuf messages generated with protobuf-go-lite.

The core server loop is simple enough:

  • accept client input
  • advance simulation
  • build world snapshot
  • broadcast snapshot to connected clients

Each simulation instance runs at 60 ticks per second, so every tick has roughly 16ms available for input processing, simulation, synchronization, and outbound networking.

At small connection counts, everything felt fine. Once I started testing hundreds of clients connected to the same simulation instance, the problems became obvious. Joining became slower, movement updates started lagging behind, snapshots backed up, and tick duration became unstable.

At that point the useful question stopped being:

why is the server slow?

and became:

what part of the realtime path gets worse as connection count grows?

The answer was mostly fan-out.

The server shape

Each WebSocket connection owns a session object.

The session manages:

  • the websocket connection
  • outbound queues
  • heartbeat state
  • write synchronization
  • inbound rate limiting

All websocket writes are intentionally serialized.

gorilla/websocket allows:

  • one concurrent reader
  • one concurrent writer

In practice, it is still easy to accidentally write from multiple goroutines:

  • the normal write loop
  • heartbeat responses
  • disconnect handlers
  • internal events

So every socket write goes through one locked path:

func (s *UserSession) write(data []byte) error {
    s.writeMu.Lock()
    defer s.writeMu.Unlock()

    if err := s.Conn.SetWriteDeadline(time.Now().Add(wsWriteTimeout)); err != nil {
        return err
    }

    return s.Conn.WriteMessage(websocket.BinaryMessage, data)
}
Enter fullscreen mode Exit fullscreen mode

That lock was not the main bottleneck, but it matters for correctness. I do not want multiple goroutines calling WriteMessage on the same connection.

The simulation state is owned and mutated by a single goroutine.

Every simulation instance has one command channel handling:

  • joins
  • leaves
  • player input
  • disconnects
  • internal events

Each tick roughly looks like this:

  • drain commands
  • advance simulation
  • build synchronization state
  • broadcast updates

The last step became the expensive one.

The shape that broke first

The simulation work itself was not the first thing to fail.

The first bad shape was the broadcast path.

Each simulation tick builds a snapshot containing replicated world state:

updatedObjects := scratch.updatedObjects[:0]

for objectID, object := range simulation.objects {
    if !object.shouldReplicate {
        continue
    }

    updatedObjects = append(
        updatedObjects,
        syncSnapshotObject(objectID, object),
    )
}
Enter fullscreen mode Exit fullscreen mode

Only objects marked for network replication are included in the snapshot.

That snapshot then gets sent to every connected client subscribed to the simulation instance.

The important detail here is that the snapshot payload is usually identical for every client in that room.

The original broadcast path effectively looked like this:

for _, recipient := range recipients {
    snapshot := buildSnapshot()
    data, _ := snapshot.MarshalVT()

    recipient.Send(data)
}
Enter fullscreen mode Exit fullscreen mode

At small scales, this feels harmless.

At larger scales, the shape becomes expensive very quickly.

With 500 connected clients running at 60 ticks per second:

500 recipients x 60 snapshots/sec
= 30,000 websocket snapshot writes/sec
Enter fullscreen mode Exit fullscreen mode

But the bigger problem was not the socket writes themselves.

The expensive part was repeatedly serializing the same snapshot payload for every connected client.

That means every tick was doing:

  • protobuf marshaling per recipient
  • allocations per recipient
  • buffer growth per recipient

even though the payload itself was identical.

The first profiling lesson was simple:

if the payload is identical for every recipient, serializing it per client is wasted work

Profiling setup

The server uses Pyroscope for continuous profiling:

runtime.SetMutexProfileFraction(5)
runtime.SetBlockProfileRate(5)

pyroscope.Start(pyroscope.Config{
    ApplicationName: "server",
    ProfileTypes: []pyroscope.ProfileType{
        pyroscope.ProfileCPU,
        pyroscope.ProfileAllocSpace,
        pyroscope.ProfileInuseSpace,
        pyroscope.ProfileGoroutines,
    },
})
Enter fullscreen mode Exit fullscreen mode

Tick duration is also exported through OpenTelemetry:

tickDuration.Record(
    context.Background(),
    time.Since(tickStartedAt).Seconds(),
)
Enter fullscreen mode Exit fullscreen mode

The metrics I cared about for this issue were:

  • active websocket sessions
  • room tick p50/p95/p99
  • protobuf marshal cost
  • allocation pressure
  • websocket backlog growth
  • goroutine buildup
  • socket write duration

The profiling run made the problem obvious.

Reliable vs unreliable messages

One important change was separating outbound traffic by semantics.

Not every websocket message deserves the same delivery behavior.

Snapshots are latest-state messages. If the client misses one snapshot, the correct behavior is usually to apply a newer one, not replay old movement history.

Some messages are different:

  • object created
  • object deleted
  • session ended
  • important gameplay events

Those are ordered state transitions and should not be dropped.

The session ended up with two outbound paths:

Reliable(frame OutboundFrame)
Unreliable(frame OutboundFrame)
Enter fullscreen mode Exit fullscreen mode

Both paths operate on an OutboundFrame.

An OutboundFrame is the object the server uses to represent a websocket payload that is ready to be written. In this case, it carries the marshaled protobuf bytes and the release logic needed when those bytes come from a pool.

Reliable messages use a bounded queue:

Outbound chan OutboundFrame
Enter fullscreen mode Exit fullscreen mode

Snapshots use a latest-only slot:

LatestUnreliable OutboundFrame
UnreliableNotify chan struct{}
Enter fullscreen mode Exit fullscreen mode

The unreliable path intentionally avoids replacing a snapshot that is already waiting to be written:

func (s *UserSession) Unreliable(frame OutboundFrame) {
    s.outboundMu.Lock()

    if s.LatestUnreliable.Data != nil {
        s.outboundMu.Unlock()
        frame.Release()
        return
    }

    s.LatestUnreliable = frame
    s.outboundMu.Unlock()

    select {
    case s.UnreliableNotify <- struct{}{}:
    default:
    }
}
Enter fullscreen mode Exit fullscreen mode

This part is easy to get wrong.

The simulation loop may call Unreliable 60 times per second. If every new snapshot immediately replaced the pending one, a busy session could keep overwriting the pending frame before the writer loop gets a chance to consume it.

So the rule is simple:

  • if no unreliable snapshot is pending, publish one
  • if one is already pending, drop the newer one and let the writer catch up

That keeps each session bounded to at most one pending unreliable snapshot.

Reliable messages queue because they represent state transitions. Unreliable snapshots do not queue because they represent latest world state.

A slow client may miss movement snapshots, but it should not force the server to retain stale movement history.

Marshaling once per broadcast

The fix was to move protobuf marshaling out of the recipient loop.

The current broadcast path marshals once:

snapshotFrame := MarshalOutboundFrame(
    snapshotEnvelope,
    snapshotRecipients,
)
Enter fullscreen mode Exit fullscreen mode

Then shares the same frame across sessions:

for _, recipient := range recipients {
    recipient.Unreliable(snapshotFrame)
}
Enter fullscreen mode Exit fullscreen mode

That changes the scaling shape from:

marshal protobuf once per recipient
to:
marshal protobuf once per broadcast

The network work is still multiplied by recipient count.

The serialization work is not.

Sharing broadcast frames safely

Once protobuf marshaling moved out of the recipient loop, the server needed a safe way to share the same websocket payload across many sessions.

The broadcast path now produces one shared OutboundFrame:

type OutboundFrame struct {
    Data    []byte
    refs    *atomic.Int32
    release func()
}
Enter fullscreen mode Exit fullscreen mode

The frame contains:

  • the marshaled websocket payload
  • a reference counter
  • a cleanup callback for returning pooled buffers

The important detail is that the byte slice comes from a pool.

The server marshals protobuf data into a reusable buffer:

buf := outboundFrameBufferPool.Get().(*writeBuffer)

if cap(buf.data) < size {
    buf.data = make([]byte, size)
}

data := buf.data[:size]

_, err := envelope.MarshalToVT(data)
Enter fullscreen mode Exit fullscreen mode

That same byte slice then gets shared across every recipient:

for _, recipient := range recipients {
    recipient.Unreliable(snapshotFrame)
}
Enter fullscreen mode Exit fullscreen mode

At that point ownership matters.

The pooled buffer cannot go back into the pool until every session has either written or dropped the frame.

So the frame carries a reference count:

func (f OutboundFrame) Release() {
    if f.refs == nil {
        return
    }

    if f.refs.Add(-1) == 0 && f.release != nil {
        f.release()
    }
}
Enter fullscreen mode Exit fullscreen mode

Each session releases the frame after writing or dropping it:

err := s.write(frame.Data)
frame.Release()
Enter fullscreen mode Exit fullscreen mode

Without ownership tracking, pooled buffers become dangerous very quickly. One session can still be writing bytes while another goroutine has already returned the buffer to the pool for reuse.

The optimized broadcast path eventually became:

build snapshot once per tick
marshal protobuf once per broadcast
write into pooled buffer
share frame across recipients
release pooled buffer after all recipients finish
drop stale unreliable snapshots
Enter fullscreen mode Exit fullscreen mode

None of these changes remove the cost of socket writes. Every connected client still needs its own websocket write.

What they remove is the avoidable work around the write:

  • repeated protobuf marshaling
  • repeated buffer allocation
  • stale snapshot queue buildup
  • unnecessary garbage collector pressure

Profiling result

This was the test shape:

simulation tick rate: 60Hz
connected clients: ~500
transport: gorilla/websocket
payload format: protobuf
snapshot delivery: unreliable latest-only
Enter fullscreen mode Exit fullscreen mode

The profiling result looked like this:

Metric Unoptimized Optimized
Snapshot builds per tick ~500 1
Snapshot marshals per tick ~500 1
Snapshot marshals/sec ~30,000 ~60
Pooled frame buffers no yes
Snapshot delivery queued per recipient latest-only
Room tick p99 ~243ms ~18–22ms
Allocation pressure very high much lower
WebSocket backlog growth severe during bursts reduced
CPU time in protobuf marshal path dominant hotspot mostly removed from broadcast multiplier

The important part is not just the p99 number.

The important part is the shape change.

The optimized version still performs one websocket write per connected client, but it no longer rebuilds and serializes identical snapshot payloads per recipient.
That moved protobuf encoding and buffer allocation out of the recipient loop.

Takeaways

The issue was not that gorilla/websocket is slow.

The issue was that fan-out multiplies small costs very aggressively.

At one client:

  • protobuf marshaling is noise
  • allocations are noise
  • queue pressure is noise

At hundreds of clients and 60 ticks per second, those costs become visible very quickly in:

  • CPU usage
  • allocations
  • websocket backlog growth
  • tick latency

The main changes were:

  • separate reliable events from snapshots
  • make snapshots latest-only
  • marshal broadcast payloads once
  • share pooled frames across sessions
  • use ref-counted ownership for shared buffers

There is still a larger scaling problem left.

Right now every replicated object is still sent to every subscribed client. That means the network shape is still:

objects x recipients
Enter fullscreen mode Exit fullscreen mode

The next step is proper interest management:

  • visibility filtering
  • area-of-interest replication
  • per-client relevance filtering

But I would not start there.

First:

  • profile the current system
  • remove multiplied work
  • fix ownership problems
  • stabilize the hot path

Then decide whether the architecture actually needs a larger cut.

Top comments (0)