Rizwan Saleem

Posted on Jun 4

Building a Resilient Real-Time Chat System with WebRTC, Faye, and WebSockets: A Practical End-to-End

#react #typescript #webdev #frontend

Building a Resilient Real-Time Chat System with WebRTC, Faye, and WebSockets: A Practical End-to-End

Building a Resilient Real-Time Chat System with WebRTC, Faye, and WebSockets: A Practical End-to-End Tutorial

In this tutorial you’ll build a small, resilient real-time chat system that works across browsers, mobile devices, and constrained networks. You’ll learn how to architect a client/server model with signaling, leverage WebRTC data channels for peer-to-peer messaging when available, and fall back to a robust WebSocket-based relay when direct peer connections fail. The focus is on practical patterns, testable code, and observability to keep a live chat running under load or flaky networks.

Key takeaways

Understand when to use WebRTC data channels vs. WebSocket relays for real-time chat
Implement a signaling server to establish WebRTC connections safely and efficiently
Build a resilient message delivery pipeline with idempotent processing and retry semantics
Instrument the system for latency, jitter, and message loss to avoid silent failures
Test end-to-end flows with simulated network conditions and automated resilience tests

Overview of architecture

Clients (web and mobile) connect to a signaling server to negotiate WebRTC peers and/or WebSocket sessions.
If a direct WebRTC peer connection is possible, use a data channel for low-latency chat messages.
If WebRTC is blocked (firewalls, NAT), route messages through a relay server using WebSockets. The relay can also bridge peers that can’t connect directly.
A lightweight message store (in-memory for demo, with an optional Redis-backed queue in production) provides at-least-once delivery guarantees and retry capabilities.
Observability: metrics for connection attempts, handshake times, message delivery latency, retry counts, and error rates. Tracing spans help diagnose cross-service flows.

Tech stack (example)

Frontend: TypeScript, WebRTC DataChannel, WebSocket
Backend: Node.js with Express for signaling API, ws or socket.io for WebSocket relays
Signaling protocol: simple JSON over WebSocket for negotiation (offers, answers, ICE candidates)
Data storage: in-memory store with optional Redis for production
Testing: Jest for unit tests, Playwright for end-to-end browser tests, network condition simulations

Step 1: Project scaffolding

Create a monorepo with two packages: client and server
Client handles UI, WebRTC, and signaling communication
Server handles signaling, relay, and simple message persistence

Code skeleton (high-level)

package.json scripts
- dev: concurrently run server and client
- test: run unit/integration tests
server/
- index.js: Express server with signaling endpoint
- signaling.js: handles WebRTC offer/answer exchange and ICE candidates
- relay.js: WebSocket relay that forwards messages between peers
client/
- src/main.ts: bootstraps UI, WebRTC setup, signaling client
- src/webrtc.ts: manages RTCPeerConnection, data channel events
- src/signaling.ts: WebSocket client to signaling server
- src/ui.tsx: minimal React UI for sending/receiving messages
shared-types.ts: common TypeScript types for signaling payloads

Step 2: Signaling protocol design

Use a simple message envelope over WebSocket:
- { type: 'offer', to: 'peerId', sdp: { ... } }
- { type: 'answer', from: 'peerId', sdp: { ... } }
- { type: 'ice', from: 'peerId', candidate: { ... } }
- { type: 'join', roomId, clientId }
- { type: 'message', payload: { id, text, timestamp } } (for chat if not using direct data channel)
Each client registers a unique clientId on connect.
Rooms help group peers for one-to-many chat or private chats.

Step 3: WebRTC data channel setup

Create RTCPeerConnection with reasonable ICE servers (STUN/TURN as needed)
For each peer, create a data channel for chat when initiating
Exchange offers/answers and ICE candidates via signaling channel
Fallback: if direct data channel cannot be established within a timeout, switch to relay mode using WebSocket messages

Key code snippets (illustrative, not a drop-in complete solution)

Signaling client (client/src/signaling.ts)
- Connect to signaling server
- Send and receive messages
- Event emitters for onOffer, onAnswer, onIceCandidate, onPeerMessage
WebRTC data channel workflow (client/src/webrtc.ts)
- function createPeerConnection(peerId: string): RTCPeerConnection
- pc.createDataChannel('chat') for initiator
- pc.ondatachannel = (event) => { const dc = event.channel; dc.onmessage = (e) => handleIncomingMessage(e.data); }
- On icecandidate: send via signaling
- On negotiationneeded: createOffer, setLocalDescription, send offer via signaling
Relay fallback (server/src/relay.js)
- Maintain a mapping of clientId to WebSocket
- When a message destined for peer arrives and no direct path exists, forward via relay
- Handle idle timeouts and cleanup

Step 4: Message delivery semantics

Each chat message has:
- id: unique per message
- from, to, timestamp
- text (or structured payload)
- delivered: boolean
- acknowledged: boolean
Delivery flow:
- Sender writes to data channel or relay and immediately records a local "sent" status
- Receiver marks message as delivered upon receipt
- Optional read receipts can be implemented with separate messages
Idempotency:
- Use idempotent storage: upsert by message id
- If a message is re-delivered due to retry, the receiver should ignore duplicates by checking id
Retries:
- On delivery failure, retry with exponential backoff up to a max
- If the connection is restored, resume sending undelivered messages

Step 5: Observability and testing

Metrics to collect:
- handshake duration, connection state changes, ICE connectivity checks
- message latency: sent timestamp vs received timestamp
- delivery success rate, retry counts, failure reasons
- WebSocket relay queue length and error rates
Tests to cover:
- Unit tests for message idempotency logic
- End-to-end test simulating network drop and reconnection
- Playwright tests verifying chat flows across two browser instances
- Integration tests for signaling and ICE candidate exchange
Network condition testing:
- Use browser dev tools to simulate latency and packet loss
- In test environment, artificially drop ICE candidates to force fallback

Step 6: Security considerations

Authenticate users with a lightweight token (JWT or opaque token) when connecting to signaling and relay
Validate all incoming signaling messages: ensure they are targeted to legitimate peers
Do not forward messages to peers not in the same room unless intended
Turn on TLS for WebSocket (wss://) and secure signaling endpoints
Consider rate limiting and abuse protection on signaling endpoints to prevent signaling storms

Step 7: Deployment tips

Start with a single signaling server and a small relay pool
Use persistent storage for message history if you need durability beyond memory
Add a simple health check endpoint for monitoring
Use containerization (Docker) for reproducible deployments and easy scaling
Consider horizontal scaling for signaling via sticky sessions or a shared pub/sub mechanism if needed

Example: minimal signaling server (Node.js with Express and ws)

The signaling server accepts WebSocket connections, assigns a clientId, and routes offers/answers/ICE candidates to the intended recipient
It does not handle media content - only signaling data and chat messages when relayed
Signaling payloads
- Offer/Answer/ICE: used to establish WebRTC connection
- Relay: when WebRTC is not possible, messages are forwarded between clients

Step-by-step implementation outline
1) Initialize repository and install dependencies

npm init -y
npm install express ws uuid socket.io (optional)
For client: install TypeScript, React, and a small UI library if desired

2) Implement signaling server

Start with an Express app
Attach a WebSocket server
Maintain a map of clientId to ws connection
Handlers:
- on connect: assign clientId, acknowledge
- on message: route to target or broadcast to room participants
- on disconnect: clean up

3) Implement relay server logic (if separate)

Accept WebSocket connections from clients
Maintain routing tables
Forward chat messages when peer is not directly connected
Ensure delivery receipts propagate back to sender

4) Implement client WebRTC/ signaling integration

Connect to signaling server
Handle 'offer', 'answer', 'ice' events
Create RTCPeerConnection, data channels
Send chat messages over data channel when connected
If data channel not open within a timeout, switch to relay mode and send messages via signaling/relay

5) Implement message store and idempotent processing

Use a Map keyed by messageId for in-memory demo
On receive, check if id exists; if not, append to chat history and emit delivery confirmation
On failure, enqueue for retry with exponential backoff

6) Instrumentation and tests

Add simple metrics endpoints or in-process metrics collector
Write unit tests for message idempotency and retry logic
Create Playwright tests simulating two browsers sending messages
Add a test that drops WebRTC candidates to force relay flow

Illustrative example: message object

{ id: "msg-abc123", from: "alice", to: "bob", text: "Hello, Bob!", timestamp: 1697048400000, delivered: false, acknowledged: false }

Illustration: decision tree for connection path

If WebRTC data channel established and open -> use data channel for low-latency messaging
If WebRTC is not established within 3-5 seconds or firewalls block -> fall back to relay via WebSocket
If relay path exists and remains healthy -> keep using relay until WebRTC can reconnect
If both paths fail -> surface an error to the user and retry after a backoff

Follow-up options

Would you like a runnable code sample repository with a minimal, production-like structure (Dockerized, with Redis optional) to try this end-to-end?
Do you prefer a pure WebRTC-based approach with TURN servers, or a relay-first approach prioritizing simplicity and reliability in constrained networks?

Rizwan Saleem | https://rizwansaleem.co

Top comments (2)

OpenIMSDK • Jun 16

I like that this frames real-time chat as a system problem rather than just a socket demo. In production, delivery state, reconnect behavior, message ordering, and client SDK consistency are usually where the complexity shows up.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.