Ayush Kumar Anand

Posted on Dec 14, 2025

I Wrote a WebSocket Client From Scratch and It Ate My RAM

#erlang #systemdesign #backend #beginners

My company's production code runs on an old version of ejabberd 18, I needed to write a websocket client from scratch.
I wrote the code, tested it and then it was deployed to production.
But randomly, my whole ejabberd vm started crashing with heap allocation error.
This is the story of how I found a silent memory leak, how etop became my best friend, and why unbounded buffers are the real villains in network programming.

🧩 The Architecture (Why It Wasn’t Just a Loop)

Our gaming backend needs hundreds–thousands of persistent WebSocket connections. So the structure was a classic Erlang supervision tree:

owebsocket_sup → root supervisor
connection_manager → maintains the required number of connections and handles periodic reconnects, because infra layers (e.g., AWS load balancers) may terminate WebSocket sessions after a fixed lifetime.
owebsocket_client_sup → dynamic supervisor
owebsocket_client → one process per connection

Each worker is isolated. One bad connection shouldn't take down the system.

In theory.

🕳️ The Bug That Hid in Plain Sight

Here’s the heart of the problem — my frame parsing logic:

retrieve_frame(State, Data) ->
      Buffer = State#owebsocket_state.buffer,
      UpdatedBuffer = <<Buffer/bits, Data/bits>>,
      State#owebsocket_state{buffer = UpdatedBuffer}.

Looks harmless, right?
Append new data → check for complete frame → wait for more.
But here’s the catch: TCP is a stream, not a message boundary.
If the server: sent partial frames, or sent a huge frame, or sent junk my parser didn’t recognize …my code never hit the “complete frame” condition.
It just kept appending.

And appending.
And appending.

Result:

1 KB → fine
50 KB → fine
10 MB → suspicious
300 MB → oh no
16 GB → "OOM-killer enters the chat

(Yes, it really hit ~16 GB. I checked etop twice because I thought the numbers were lying.)

Large binaries in Erlang live off-heap.
Once a process holds a reference to a giant binary, the VM cannot free it until that process dies.
So one unlucky WebSocket client was quietly hoarding RAM like a dragon.

🔍 The Worst Part: It Was Random

When debugging memory issues, consistency is your best friend.

This bug had none.

Some days it ran perfectly for 6 hours.
Some days it crashed in 20 minutes.
Some days it didn’t crash at all.

I started suspecting TCP fragmentation patterns, upstream throttling, maybe even ghost data. It was the kind of randomness that makes you question your life choices.

So I opened etop and watched memory usage per process and added logs to see the size of updated buffer before it carshed.

At first: stable.
Later: one client process growing linearly.
Eventually: one worker at 700 MB alone.
And in the log file the size was ~10GB.
That’s when the bulb went off.

The buffer never reset.

🛠️ The Fix: Never Trust the Network

I introduced hard limits:

-define(MAX_BUFFER_SIZE, 5 * 1024 * 1024). %% 5 MB
-define(ERR_BUFFER_SIZE_LIMIT_EXCEEDED, {error, buffer_limit_exceeded}).

And updated the logic:

retrieve_frame(State, Data) ->
    Buffer = State#owebsocket_state.buffer,
    UpdatedBuffer = <<Buffer/bits, Data/bits>>,

    if size(UpdatedBuffer) > ?MAX_BUFFER_SIZE ->
        ?ERR_BUFFER_SIZE_LIMIT_EXCEEDED;
    true ->
        State#owebsocket_state{buffer = UpdatedBuffer}
    end.

Then handled it safely:

handle_info({Transport, Socket, Bs}, State) ->
    case handle_response(Bs, State) of
        ?ERR_BUFFER_SIZE_LIMIT_EXCEEDED ->
            error_logger:error_msg("Buffer limit exceeded. Dropping connection."),

            %% Stop receiving data
            Transport:close(Socket),

            %% Reconnect with fresh state
            NewState =
                try_reconnect(buffer_limit,
                              State#owebsocket_state{
                                  socket = undefined,
                                  buffer = <<>> }),
            {noreply, NewState};

        NewState ->
            {noreply, NewState}
    end.

If a client misbehaves → drop it.
If a frame is too large → drop it.
If random partial data confuses the parser → drop it.

Better one connection dies than the whole VM.

📘 What I Learned

Building protocols from scratch teaches you things libraries hide from you:

✔ Always enforce buffer size limits

If you don’t, RAM will do it for you.

✔ Never assume input is reasonable

Even if the spec says so.

✔ etop and logging are your friends

It will show you exactly which process is misbehaving and error logs.

✔ Restart one process, not the VM

That’s the whole point of Erlang.

✔ The best code is sometimes the one that says “nope.”

Dropping a bad connection saved the entire system.

After adding a 5 MB limit and a reconnection strategy, the system has been stable — no more OOM kills, no more ghost crashes, and no more 2 AM staring contests with erl_crash.dump.

Sometimes reliability is not about writing more code.

It’s about knowing when to stop accepting input.

DEV Community