My company's production code runs on an old version of ejabberd 18, I needed to write a websocket client from scratch.
I wrote the code, tested it and then it was deployed to production.
But randomly, my whole ejabberd vm started crashing with heap allocation error.
This is the story of how I found a silent memory leak, how etop became my best friend, and why unbounded buffers are the real villains in network programming.
🧩 The Architecture (Why It Wasn’t Just a Loop)
Our gaming backend needs hundreds–thousands of persistent WebSocket connections. So the structure was a classic Erlang supervision tree:
-
owebsocket_sup→ root supervisor -
connection_manager→ maintains the required number of connections and handles periodic reconnects, because infra layers (e.g., AWS load balancers) may terminate WebSocket sessions after a fixed lifetime. -
owebsocket_client_sup→ dynamic supervisor -
owebsocket_client→ one process per connection
Each worker is isolated. One bad connection shouldn't take down the system.
In theory.
🕳️ The Bug That Hid in Plain Sight
Here’s the heart of the problem — my frame parsing logic:
retrieve_frame(State, Data) ->
Buffer = State#owebsocket_state.buffer,
UpdatedBuffer = <<Buffer/bits, Data/bits>>,
State#owebsocket_state{buffer = UpdatedBuffer}.
Looks harmless, right?
Append new data → check for complete frame → wait for more.
But here’s the catch: TCP is a stream, not a message boundary.
If the server: sent partial frames, or sent a huge frame, or sent junk my parser didn’t recognize …my code never hit the “complete frame” condition.
It just kept appending.
And appending.
And appending.
Result:
- 1 KB → fine
- 50 KB → fine
- 10 MB → suspicious
- 300 MB → oh no
- 16 GB → "OOM-killer enters the chat
(Yes, it really hit ~16 GB. I checked etop twice because I thought the numbers were lying.)
Large binaries in Erlang live off-heap.
Once a process holds a reference to a giant binary, the VM cannot free it until that process dies.
So one unlucky WebSocket client was quietly hoarding RAM like a dragon.
🔍 The Worst Part: It Was Random
When debugging memory issues, consistency is your best friend.
This bug had none.
Some days it ran perfectly for 6 hours.
Some days it crashed in 20 minutes.
Some days it didn’t crash at all.
I started suspecting TCP fragmentation patterns, upstream throttling, maybe even ghost data. It was the kind of randomness that makes you question your life choices.
So I opened etop and watched memory usage per process and added logs to see the size of updated buffer before it carshed.
At first: stable.
Later: one client process growing linearly.
Eventually: one worker at 700 MB alone.
And in the log file the size was ~10GB.
That’s when the bulb went off.
The buffer never reset.
🛠️ The Fix: Never Trust the Network
I introduced hard limits:
-define(MAX_BUFFER_SIZE, 5 * 1024 * 1024). %% 5 MB
-define(ERR_BUFFER_SIZE_LIMIT_EXCEEDED, {error, buffer_limit_exceeded}).
And updated the logic:
retrieve_frame(State, Data) ->
Buffer = State#owebsocket_state.buffer,
UpdatedBuffer = <<Buffer/bits, Data/bits>>,
if size(UpdatedBuffer) > ?MAX_BUFFER_SIZE ->
?ERR_BUFFER_SIZE_LIMIT_EXCEEDED;
true ->
State#owebsocket_state{buffer = UpdatedBuffer}
end.
Then handled it safely:
handle_info({Transport, Socket, Bs}, State) ->
case handle_response(Bs, State) of
?ERR_BUFFER_SIZE_LIMIT_EXCEEDED ->
error_logger:error_msg("Buffer limit exceeded. Dropping connection."),
%% Stop receiving data
Transport:close(Socket),
%% Reconnect with fresh state
NewState =
try_reconnect(buffer_limit,
State#owebsocket_state{
socket = undefined,
buffer = <<>> }),
{noreply, NewState};
NewState ->
{noreply, NewState}
end.
If a client misbehaves → drop it.
If a frame is too large → drop it.
If random partial data confuses the parser → drop it.
Better one connection dies than the whole VM.
📘 What I Learned
Building protocols from scratch teaches you things libraries hide from you:
✔ Always enforce buffer size limits
If you don’t, RAM will do it for you.
✔ Never assume input is reasonable
Even if the spec says so.
✔ etop and logging are your friends
It will show you exactly which process is misbehaving and error logs.
✔ Restart one process, not the VM
That’s the whole point of Erlang.
✔ The best code is sometimes the one that says “nope.”
Dropping a bad connection saved the entire system.
After adding a 5 MB limit and a reconnection strategy, the system has been stable — no more OOM kills, no more ghost crashes, and no more 2 AM staring contests with erl_crash.dump.
Sometimes reliability is not about writing more code.
It’s about knowing when to stop accepting input.
Top comments (0)