snowlyg

Posted on Jun 4

WebSocket Half-Open Connections and gRPC Gateway Discovery in Complex LAN WebRTC Systems

#webrtc #go #grpc #android

The target scenario is not consumer calling. It is closer to B2B institutional device fleets: many shared devices, long-running uptime, centralized operations, noisy rooms, and a need to preserve both pickup distance and speech quality.

In constrained LAN environments, the hardest WebRTC failure is not always a clean disconnect. The more dangerous state is a control connection that still appears alive while application messages no longer move.

The media path can still be WebRTC. The question is how to design the control plane around it.

The Failure Mode

A WebSocket-centered signaling path can work well in a simple network. Each client connects to a central service, receives call events, and forwards offer, answer, and ICE candidate messages.

In constrained networks, long-lived connections may become stale. A client can stay alive while its upstream connection is no longer useful. The server may still hold old state. The UI may still show the peer as online.

For a WebRTC call system, that can mean:

call events do not reach the callee
timeout does not close all affected UI
stale sessions block new calls
group-call media state is not cleaned up correctly

The Gateway Boundary

The design I prefer is to keep media on WebRTC and move control recovery into a gateway layer.

A Go/gRPC gateway can own:

device registration
gateway discovery
heartbeat and reachability state
call session routing
offer, answer, and ICE candidate forwarding
timeout and hangup convergence
cleanup after abnormal state

Audio observability also belongs in the reliability model. In shared rooms, the system has to handle background noise while preserving enough pickup distance. That means call state, recovery state, and audio state should be debugged together instead of treated as unrelated problems.

The key is bidirectional reachability. Recovery should not depend only on a stale client-initiated connection noticing that it is broken.

Session State Matters

A primary node or discovery service needs a state table:

gateway id
current reachability
latest heartbeat time
current call state
active session ids
gateway version
latest error reason

This table answers operational questions: which gateway should receive a call event, whether a peer is actually reachable now, and whether an old session must be closed before a new call starts.

Call Lifecycle Convergence

Timeout and hangup should be converged by server-side state.

For one-to-one calls, either side can request hangup. The primary node should notify both gateways, remove media mappings, and mark the session as ended.

For group calls, one participant leaving should remove only that participant's media. A full hangup should close the entire session. Without explicit participant and media mapping, group cleanup becomes either too weak or too destructive.

What to Validate

The useful tests are forced failures:

Stop one gateway while the call is ringing.
Drop the network path between two gateways.
Keep the UI open while the control stream becomes unusable.
Make one participant leave a group call.
Trigger timeout and manual hangup at nearly the same time.

The goal is not that nothing ever disconnects. The goal is that every abnormal path reaches a clear state and leaves no stale UI, stale session, or stale media mapping.

Takeaway

This is not simply replacing WebSocket with gRPC. It is designing a stronger recovery model:

gateways can discover each other
the primary node owns session state
both sides can be reached by control events
timeout and hangup are explicit
group-call media mappings are tracked

For WebRTC systems in constrained LANs, recovery has to be part of the architecture, not an afterthought.

DEV Community