DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

Building a Resilient Real-Time Chat System with WebRTC, Faye, and WebSockets: A Practical End-to-End

Building a Resilient Real-Time Chat System with WebRTC, Faye, and WebSockets: A Practical End-to-End

Building a Resilient Real-Time Chat System with WebRTC, Faye, and WebSockets: A Practical End-to-End Tutorial

In this tutorial you’ll build a small, resilient real-time chat system that works across browsers, mobile devices, and constrained networks. You’ll learn how to architect a client/server model with signaling, leverage WebRTC data channels for peer-to-peer messaging when available, and fall back to a robust WebSocket-based relay when direct peer connections fail. The focus is on practical patterns, testable code, and observability to keep a live chat running under load or flaky networks.

Key takeaways

  • Understand when to use WebRTC data channels vs. WebSocket relays for real-time chat
  • Implement a signaling server to establish WebRTC connections safely and efficiently
  • Build a resilient message delivery pipeline with idempotent processing and retry semantics
  • Instrument the system for latency, jitter, and message loss to avoid silent failures
  • Test end-to-end flows with simulated network conditions and automated resilience tests

Overview of architecture

  • Clients (web and mobile) connect to a signaling server to negotiate WebRTC peers and/or WebSocket sessions.
  • If a direct WebRTC peer connection is possible, use a data channel for low-latency chat messages.
  • If WebRTC is blocked (firewalls, NAT), route messages through a relay server using WebSockets. The relay can also bridge peers that can’t connect directly.
  • A lightweight message store (in-memory for demo, with an optional Redis-backed queue in production) provides at-least-once delivery guarantees and retry capabilities.
  • Observability: metrics for connection attempts, handshake times, message delivery latency, retry counts, and error rates. Tracing spans help diagnose cross-service flows.

Tech stack (example)

  • Frontend: TypeScript, WebRTC DataChannel, WebSocket
  • Backend: Node.js with Express for signaling API, ws or socket.io for WebSocket relays
  • Signaling protocol: simple JSON over WebSocket for negotiation (offers, answers, ICE candidates)
  • Data storage: in-memory store with optional Redis for production
  • Testing: Jest for unit tests, Playwright for end-to-end browser tests, network condition simulations

Step 1: Project scaffolding

  • Create a monorepo with two packages: client and server
  • Client handles UI, WebRTC, and signaling communication
  • Server handles signaling, relay, and simple message persistence

Code skeleton (high-level)

  • package.json scripts
    • dev: concurrently run server and client
    • test: run unit/integration tests
  • server/
    • index.js: Express server with signaling endpoint
    • signaling.js: handles WebRTC offer/answer exchange and ICE candidates
    • relay.js: WebSocket relay that forwards messages between peers
  • client/
    • src/main.ts: bootstraps UI, WebRTC setup, signaling client
    • src/webrtc.ts: manages RTCPeerConnection, data channel events
    • src/signaling.ts: WebSocket client to signaling server
    • src/ui.tsx: minimal React UI for sending/receiving messages
  • shared-types.ts: common TypeScript types for signaling payloads

Step 2: Signaling protocol design

  • Use a simple message envelope over WebSocket:
    • { type: 'offer', to: 'peerId', sdp: { ... } }
    • { type: 'answer', from: 'peerId', sdp: { ... } }
    • { type: 'ice', from: 'peerId', candidate: { ... } }
    • { type: 'join', roomId, clientId }
    • { type: 'message', payload: { id, text, timestamp } } (for chat if not using direct data channel)
  • Each client registers a unique clientId on connect.
  • Rooms help group peers for one-to-many chat or private chats.

Step 3: WebRTC data channel setup

  • Create RTCPeerConnection with reasonable ICE servers (STUN/TURN as needed)
  • For each peer, create a data channel for chat when initiating
  • Exchange offers/answers and ICE candidates via signaling channel
  • Fallback: if direct data channel cannot be established within a timeout, switch to relay mode using WebSocket messages

Key code snippets (illustrative, not a drop-in complete solution)

  • Signaling client (client/src/signaling.ts)

    • Connect to signaling server
    • Send and receive messages
    • Event emitters for onOffer, onAnswer, onIceCandidate, onPeerMessage
  • WebRTC data channel workflow (client/src/webrtc.ts)

    • function createPeerConnection(peerId: string): RTCPeerConnection
    • pc.createDataChannel('chat') for initiator
    • pc.ondatachannel = (event) => { const dc = event.channel; dc.onmessage = (e) => handleIncomingMessage(e.data); }
    • On icecandidate: send via signaling
    • On negotiationneeded: createOffer, setLocalDescription, send offer via signaling
  • Relay fallback (server/src/relay.js)

    • Maintain a mapping of clientId to WebSocket
    • When a message destined for peer arrives and no direct path exists, forward via relay
    • Handle idle timeouts and cleanup

Step 4: Message delivery semantics

  • Each chat message has:
    • id: unique per message
    • from, to, timestamp
    • text (or structured payload)
    • delivered: boolean
    • acknowledged: boolean
  • Delivery flow:
    • Sender writes to data channel or relay and immediately records a local "sent" status
    • Receiver marks message as delivered upon receipt
    • Optional read receipts can be implemented with separate messages
  • Idempotency:
    • Use idempotent storage: upsert by message id
    • If a message is re-delivered due to retry, the receiver should ignore duplicates by checking id
  • Retries:
    • On delivery failure, retry with exponential backoff up to a max
    • If the connection is restored, resume sending undelivered messages

Step 5: Observability and testing

  • Metrics to collect:
    • handshake duration, connection state changes, ICE connectivity checks
    • message latency: sent timestamp vs received timestamp
    • delivery success rate, retry counts, failure reasons
    • WebSocket relay queue length and error rates
  • Tests to cover:
    • Unit tests for message idempotency logic
    • End-to-end test simulating network drop and reconnection
    • Playwright tests verifying chat flows across two browser instances
    • Integration tests for signaling and ICE candidate exchange
  • Network condition testing:
    • Use browser dev tools to simulate latency and packet loss
    • In test environment, artificially drop ICE candidates to force fallback

Step 6: Security considerations

  • Authenticate users with a lightweight token (JWT or opaque token) when connecting to signaling and relay
  • Validate all incoming signaling messages: ensure they are targeted to legitimate peers
  • Do not forward messages to peers not in the same room unless intended
  • Turn on TLS for WebSocket (wss://) and secure signaling endpoints
  • Consider rate limiting and abuse protection on signaling endpoints to prevent signaling storms

Step 7: Deployment tips

  • Start with a single signaling server and a small relay pool
  • Use persistent storage for message history if you need durability beyond memory
  • Add a simple health check endpoint for monitoring
  • Use containerization (Docker) for reproducible deployments and easy scaling
  • Consider horizontal scaling for signaling via sticky sessions or a shared pub/sub mechanism if needed

Example: minimal signaling server (Node.js with Express and ws)

  • The signaling server accepts WebSocket connections, assigns a clientId, and routes offers/answers/ICE candidates to the intended recipient
  • It does not handle media content - only signaling data and chat messages when relayed

  • Signaling payloads

    • Offer/Answer/ICE: used to establish WebRTC connection
    • Relay: when WebRTC is not possible, messages are forwarded between clients

Step-by-step implementation outline
1) Initialize repository and install dependencies

  • npm init -y
  • npm install express ws uuid socket.io (optional)
  • For client: install TypeScript, React, and a small UI library if desired

2) Implement signaling server

  • Start with an Express app
  • Attach a WebSocket server
  • Maintain a map of clientId to ws connection
  • Handlers:
    • on connect: assign clientId, acknowledge
    • on message: route to target or broadcast to room participants
    • on disconnect: clean up

3) Implement relay server logic (if separate)

  • Accept WebSocket connections from clients
  • Maintain routing tables
  • Forward chat messages when peer is not directly connected
  • Ensure delivery receipts propagate back to sender

4) Implement client WebRTC/ signaling integration

  • Connect to signaling server
  • Handle 'offer', 'answer', 'ice' events
  • Create RTCPeerConnection, data channels
  • Send chat messages over data channel when connected
  • If data channel not open within a timeout, switch to relay mode and send messages via signaling/relay

5) Implement message store and idempotent processing

  • Use a Map keyed by messageId for in-memory demo
  • On receive, check if id exists; if not, append to chat history and emit delivery confirmation
  • On failure, enqueue for retry with exponential backoff

6) Instrumentation and tests

  • Add simple metrics endpoints or in-process metrics collector
  • Write unit tests for message idempotency and retry logic
  • Create Playwright tests simulating two browsers sending messages
  • Add a test that drops WebRTC candidates to force relay flow

Illustrative example: message object

  • { id: "msg-abc123", from: "alice", to: "bob", text: "Hello, Bob!", timestamp: 1697048400000, delivered: false, acknowledged: false }

Illustration: decision tree for connection path

  • If WebRTC data channel established and open -> use data channel for low-latency messaging
  • If WebRTC is not established within 3-5 seconds or firewalls block -> fall back to relay via WebSocket
  • If relay path exists and remains healthy -> keep using relay until WebRTC can reconnect
  • If both paths fail -> surface an error to the user and retry after a backoff

Follow-up options

  • Would you like a runnable code sample repository with a minimal, production-like structure (Dockerized, with Redis optional) to try this end-to-end?
  • Do you prefer a pure WebRTC-based approach with TURN servers, or a relay-first approach prioritizing simplicity and reliability in constrained networks?

-

Rizwan Saleem | https://rizwansaleem.co

Top comments (0)