WebSockets and "stateless" feel like a contradiction. A WebSocket is a persistent, stateful connection — the server holds a socket open per client, and your broadcasting logic depends on knowing exactly which process owns which connection.
That's the assumption this article dismantles.
Using AWS API Gateway WebSocket API, DynamoDB, and Lambda, you can build real-time systems — chat, live dashboards, multiplayer games — where no server holds any connection state. Every handler is a stateless function. Any function can message any client. You can scale to zero and back without losing routing.
Let's build it.
The Problem With Traditional WebSocket Servers
A classic Node.js WebSocket server looks roughly like this:
// Traditional — stateful, process-bound
const clients = new Map(); // connectionId → socket
wss.on('connection', (socket) => {
const id = generateId();
clients.set(id, socket);
socket.on('message', (data) => {
// Broadcast to everyone — works ONLY if all sockets
// are in THIS process
clients.forEach((client) => {
if (client.readyState === WebSocket.OPEN) {
client.send(data);
}
});
});
socket.on('close', () => clients.delete(id));
});
This clients map is the problem. It lives in one process's memory. The moment you run two instances, a client connected to pod A can't be reached by code running on pod B. The solutions people reach for — sticky sessions, a Redis pub/sub layer, a shared socket server — all add operational complexity and reintroduce the statefulness you're trying to escape.
API Gateway's WebSocket support solves this at the infrastructure level.
How API Gateway WebSocket APIs Work
When a client connects to an API Gateway WebSocket endpoint, API Gateway:
-
Assigns a unique
connectionId— a string likeabc123==that identifies this socket globally -
Invokes your Lambda on three route keys:
$connect,$disconnect, and$default(or custom routes you define) - Holds the socket open itself — your Lambda runs and exits; the connection persists in API Gateway's infrastructure
-
Exposes a management endpoint you can POST to at any time to push a message to any
connectionId
POST https://{api-id}.execute-api.{region}.amazonaws.com/{stage}/@connections/{connectionId}
Your Lambda doesn't hold any socket. It receives an event, does work, posts to the management endpoint, and exits. The connection state lives in API Gateway — external to your code.
This means your connectionId is the unit of state you need to manage. Store it somewhere queryable, and you can route messages from anywhere, at any time, without knowing which server (if any) is "holding" that connection.
That store is DynamoDB.
The Architecture
Client A ──────┐
Client B ──────┤──► API Gateway WebSocket ──► Lambda ($connect) ──► DynamoDB (write connectionId)
Client C ──────┘ ──► Lambda ($disconnect) ──► DynamoDB (delete connectionId)
──► Lambda ($default) ──► read DynamoDB → POST to management API
Three Lambda functions. One DynamoDB table. No persistent servers.
Step 1 — The DynamoDB Table
Design the table around your query patterns. For a chat app with rooms:
Table: ws-connections
PK (partition key): connectionId (string) — e.g. "abc123=="
SK (sort key): roomId (string) — e.g. "room#general"
GSI: roomId-index
PK: roomId
— lets you query "all connections in room X"
Additional attributes:
userId string
connectedAt number (Unix timestamp)
ttl number (Unix timestamp — auto-expire stale connections)
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('ws-connections')
def write_connection(connection_id: str, room_id: str, user_id: str, ttl: int):
table.put_item(Item={
'connectionId': connection_id,
'roomId': f'room#{room_id}',
'userId': user_id,
'connectedAt': int(time.time()),
'ttl': ttl # DynamoDB TTL — auto-delete after session expires
})
def delete_connection(connection_id: str, room_id: str):
table.delete_item(Key={
'connectionId': connection_id,
'roomId': f'room#{room_id}'
})
def get_room_connections(room_id: str) -> list:
response = table.query(
IndexName='roomId-index',
KeyConditionExpression='roomId = :rid',
ExpressionAttributeValues={':rid': f'room#{room_id}'}
)
return [item['connectionId'] for item in response['Items']]
The TTL field is important. WebSocket clients disconnect without always firing $disconnect (browser tab crash, network drop). TTL ensures stale connection records expire automatically rather than accumulating and causing phantom-delivery errors.
Step 2 — The $connect Handler
Invoked when a client opens a WebSocket connection. Writes the connectionId to DynamoDB.
# handlers/connect.py
import json, os, time, boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['CONNECTIONS_TABLE'])
def handler(event, context):
connection_id = event['requestContext']['connectionId']
query_params = event.get('queryStringParameters') or {}
room_id = query_params.get('room', 'general')
user_id = query_params.get('userId', 'anonymous')
# TTL: 24 hours from now
ttl = int(time.time()) + 86400
table.put_item(Item={
'connectionId': connection_id,
'roomId': f'room#{room_id}',
'userId': user_id,
'connectedAt': int(time.time()),
'ttl': ttl
})
return {'statusCode': 200}
The client connects like this:
const ws = new WebSocket(
'wss://abc123.execute-api.us-east-1.amazonaws.com/prod' +
'?room=general&userId=user_42'
);
Query string parameters arrive in event.queryStringParameters — a clean way to pass initial context without a separate HTTP call.
Step 3 — The $disconnect Handler
Invoked when the connection closes (cleanly). Removes the record from DynamoDB.
# handlers/disconnect.py
import os, boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['CONNECTIONS_TABLE'])
def handler(event, context):
connection_id = event['requestContext']['connectionId']
# Query all rooms this connectionId appears in (a user might be in multiple)
response = table.query(
KeyConditionExpression=Key('connectionId').eq(connection_id)
)
with table.batch_writer() as batch:
for item in response['Items']:
batch.delete_item(Key={
'connectionId': item['connectionId'],
'roomId': item['roomId']
})
return {'statusCode': 200}
Step 4 — The Message Handler (Broadcasting)
This is where the stateless magic becomes concrete. To broadcast a message to a room, this Lambda:
- Reads all
connectionIds for the room from DynamoDB - POSTs the message to each one via the API Gateway Management API
- Handles stale connections gracefully (GoneException)
# handlers/message.py
import json, os, boto3
from boto3.dynamodb.conditions import Key
from botocore.exceptions import ClientError
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(os.environ['CONNECTIONS_TABLE'])
def get_management_client(event):
domain = event['requestContext']['domainName']
stage = event['requestContext']['stage']
return boto3.client(
'apigatewaymanagementapi',
endpoint_url=f'https://{domain}/{stage}'
)
def handler(event, context):
connection_id = event['requestContext']['connectionId']
body = json.loads(event.get('body') or '{}')
room_id = body.get('room', 'general')
message = body.get('message', '')
user_id = body.get('userId', 'anonymous')
# Fetch all connections in the room
response = table.query(
IndexName='roomId-index',
KeyConditionExpression=Key('roomId').eq(f'room#{room_id}')
)
connections = response['Items']
apigw = get_management_client(event)
stale = []
payload = json.dumps({
'type': 'message',
'room': room_id,
'userId': user_id,
'message': message
})
for conn in connections:
cid = conn['connectionId']
try:
apigw.post_to_connection(
ConnectionId=cid,
Data=payload.encode('utf-8')
)
except ClientError as e:
if e.response['Error']['Code'] == 'GoneException':
# Connection no longer exists — queue for cleanup
stale.append(cid)
else:
raise
# Clean up stale connections that missed $disconnect
if stale:
with table.batch_writer() as batch:
for cid in stale:
batch.delete_item(Key={
'connectionId': cid,
'roomId': f'room#{room_id}'
})
return {'statusCode': 200}
GoneException is the key error to handle. It means API Gateway has no record of that connectionId — the client disconnected without your $disconnect handler firing. When you catch it, delete the stale record so it doesn't keep appearing in future broadcasts.
Step 5 — Infrastructure with SAM / CloudFormation
# template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
Runtime: python3.12
Timeout: 10
Environment:
Variables:
CONNECTIONS_TABLE: !Ref ConnectionsTable
Resources:
WebSocketApi:
Type: AWS::ApiGatewayV2::Api
Properties:
Name: StatelessWsApi
ProtocolType: WEBSOCKET
RouteSelectionExpression: "$request.body.action"
ConnectRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref WebSocketApi
RouteKey: $connect
Target: !Sub integrations/${ConnectIntegration}
DisconnectRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref WebSocketApi
RouteKey: $disconnect
Target: !Sub integrations/${DisconnectIntegration}
DefaultRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref WebSocketApi
RouteKey: $default
Target: !Sub integrations/${MessageIntegration}
ConnectFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handlers/connect.handler
Policies:
- DynamoDBCrudPolicy:
TableName: !Ref ConnectionsTable
DisconnectFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handlers/disconnect.handler
Policies:
- DynamoDBCrudPolicy:
TableName: !Ref ConnectionsTable
MessageFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handlers/message.handler
Policies:
- DynamoDBCrudPolicy:
TableName: !Ref ConnectionsTable
- Statement:
Effect: Allow
Action: execute-api:ManageConnections
Resource: !Sub arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${WebSocketApi}/*
ConnectionsTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: ws-connections
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: connectionId
AttributeType: S
- AttributeName: roomId
AttributeType: S
KeySchema:
- AttributeName: connectionId
KeyType: HASH
- AttributeName: roomId
KeyType: RANGE
GlobalSecondaryIndexes:
- IndexName: roomId-index
KeySchema:
- AttributeName: roomId
KeyType: HASH
Projection:
ProjectionType: ALL
TimeToLiveSpecification:
AttributeName: ttl
Enabled: true
Advanced: Broadcasting From Outside the WebSocket Context
One of the most powerful patterns this architecture unlocks: any service can push to any connected client, not just Lambda functions triggered by WebSocket messages.
A background job finishing a report, a payment service confirming a charge, a CI pipeline completing a build — any of them can push a real-time update to the right client by querying DynamoDB for the connectionId and posting to the management endpoint.
# billing_service/notify_client.py
# This service has no WebSocket context — it just knows the userId
import boto3, json, os
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('ws-connections')
def notify_user(user_id: str, payload: dict):
"""
Find all active connections for a userId and push a message.
Works from any service — no WebSocket server required.
"""
response = table.query(
IndexName='userId-index', # add this GSI to support user-based lookups
KeyConditionExpression=Key('userId').eq(user_id)
)
apigw = boto3.client(
'apigatewaymanagementapi',
endpoint_url=os.environ['APIGW_MANAGEMENT_URL']
)
message = json.dumps(payload).encode('utf-8')
for item in response['Items']:
try:
apigw.post_to_connection(
ConnectionId=item['connectionId'],
Data=message
)
except apigw.exceptions.GoneException:
table.delete_item(Key={
'connectionId': item['connectionId'],
'roomId': item['roomId']
})
This is the pattern that makes stateless WebSockets genuinely powerful for real-time apps. Your billing service, order service, or notification service can push updates to the browser the moment something happens — no polling, no dedicated WebSocket server, no shared state.
Client-Side: Handling Reconnection
Because the state is in DynamoDB and not in a server process, reconnection is clean. A new connectionId gets a new record; old records expire via TTL.
class ReconnectingWebSocket {
constructor(url, options = {}) {
this.url = url;
this.options = options;
this.attempts = 0;
this.connect();
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
this.attempts = 0;
console.log('Connected — new connectionId assigned by API Gateway');
this.options.onOpen?.();
};
this.ws.onmessage = (event) => {
this.options.onMessage?.(JSON.parse(event.data));
};
this.ws.onclose = () => {
const delay = Math.min(1000 * 2 ** this.attempts, 30000);
this.attempts++;
console.log(`Reconnecting in ${delay}ms (attempt ${this.attempts})`);
setTimeout(() => this.connect(), delay);
};
this.ws.onerror = (err) => {
console.error('WebSocket error', err);
this.ws.close();
};
}
send(data) {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify(data));
}
}
}
// Usage
const ws = new ReconnectingWebSocket(
`wss://abc123.execute-api.us-east-1.amazonaws.com/prod?room=general&userId=${userId}`,
{
onOpen: () => console.log('Ready'),
onMessage: (msg) => renderMessage(msg)
}
);
Exponential backoff with a 30s cap is the standard pattern. Each reconnect gets a fresh connectionId, and DynamoDB picks up the new record automatically.
DynamoDB Access Patterns Summary
| Query | Access pattern | Index used |
|---|---|---|
| Write a new connection | put_item |
Primary table |
| Delete on disconnect | delete_item |
Primary table |
| Get all connections in a room |
query by roomId
|
roomId-index GSI |
| Get all connections for a user |
query by userId
|
userId-index GSI |
| Expire stale connections | TTL attribute | Automatic |
What This Architecture Cannot Do
Be honest about the tradeoffs:
Latency budget per message. Each broadcast involves at least one DynamoDB read + N API Gateway POST calls. For a room with 50 connections, that's 51 DynamoDB operations per message. At DynamoDB's single-digit millisecond latency this is fine for chat and dashboards, but for sub-10ms gaming tick rates you'll want a different approach (Redis Pub/Sub, or purpose-built services like Ably or Pusher).
Lambda cold starts. The first invocation after a period of inactivity has a cold start penalty. For sporadic use cases this is fine. For a live dashboard that must respond instantly at all hours, use provisioned concurrency on the message handler Lambda.
Connection limits. API Gateway WebSocket APIs support up to 500 new connections per second per region (soft limit, can be raised). For very high-connection-rate scenarios — think a live event where 10,000 users join in 60 seconds — plan for this limit and request an increase in advance.
When To Use This Pattern
This architecture is a strong fit when:
- You need real-time push but don't want to operate WebSocket servers
- Your backend is already serverless or event-driven
- Connection counts are in the tens of thousands, not millions
- You want any downstream service to be able to push updates without coupling to a WebSocket layer
- You're building chat, live notifications, collaborative editing, or live dashboards
It's a poor fit for:
- High-frequency gaming (tick rates under 50ms)
- Millions of concurrent connections (consider Ably, Pusher, or self-hosted at that scale)
- Binary protocols (API Gateway WebSocket only supports text/JSON frames)
Recap
The stateless WebSocket pattern flips the traditional model:
| Traditional | API Gateway + DynamoDB |
|---|---|
| Server holds socket in memory | API Gateway holds the socket |
| Broadcast requires shared process | Broadcast via management API from anywhere |
| Scale-out needs sticky sessions | Scale-out is trivial — no shared state |
| Connection loss = state loss | Connection loss = TTL cleanup |
| Ops burden: WebSocket server fleet | Ops burden: near zero |
The connectionId is your external handle on what is otherwise an invisible infrastructure socket. DynamoDB is your routing table. Lambda is your logic. None of them need to know about each other — which is exactly what stateless means.
Top comments (0)