Building a Real-Time Collaborative Code Editor: System Design and Architecture Guide
Building a Real-Time Collaborative Code Editor: System Design and Architecture Guide
Building a real-time collaborative code editor like Google Docs for code is one of the most challenging system design problems. You need to handle concurrent edits, resolve conflicts automatically, maintain cursor positions for multiple users, and deliver sub-100ms latency-all while supporting syntax highlighting, IntelliSense, and thousands of lines of code. This guide walks you through the complete architecture from first principles.
Why This Is Hard
Traditional CRUD applications don't face these challenges:
| Challenge | Traditional App | Collaborative Editor |
|---|---|---|
| Concurrency | Row-level locking | Hundreds of simultaneous edits |
| Conflict Resolution | Last-write-wins | Operational Transformation or CRDTs |
| Latency Requirement | 1-2 seconds acceptable | <100ms for perceived real-time |
| State Sync | Pull on refresh | Continuous push + bidirectional sync |
| Cursor Tracking | Not needed | Multi-user cursor positions + selections |
The core problem: when User A deletes line 10 and User B inserts text at line 10 simultaneously, how do you ensure both users end up with the same document without losing changes?
Architecture Overview
┌─────────────┐ WebSocket ┌──────────────┐ Redis Pub/Sub ┌─────────────┐
│ Client A │◄──────────────────►│ Gateway │◄─────────────────────►│ Client B │
│ (React) │ │ (Node.js) │ │ (React) │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌──────────────┐
│ CRDT Engine │
│ (Yjs/Automerge)
└──────────────┘
│
▼
┌──────────────┐
│ PostgreSQL │
│ (Persistence)
└──────────────┘
Key Components
- Client: React application with Monaco Editor (VS Code's editor)
- Gateway: WebSocket server handling connections and message routing
- CRDT Engine: Conflict-free Replicated Data Type implementation for conflict-free merges
- Persistence: PostgreSQL for storing document snapshots
Step 1: Choosing Your Conflict Resolution Strategy
You have two main options:
Operational Transformation (OT)
- Used by Google Docs originally
- Transform operations against each other before applying
- Complex to implement correctly
- Requires central server to order operations
CRDTs (Conflict-free Replicated Data Types)
- Mathematically guaranteed to converge
- Can work peer-to-peer or with central server
- Slightly higher memory overhead
- Recommended for new projects
We'll use Yjs, the most production-ready CRDT library:
// Package.json dependencies
{
"dependencies": {
"yjs": "^13.6.14",
"y-websocket": "^1.5.0",
"monaco-editor": "^0.45.0"
}
}
Step 2: Client-Side Implementation with Monaco Editor
Setting Up the Editor
// src/editor/CollaborativeEditor.tsx
import React, { useEffect, useRef } from 'react';
import * as monaco from 'monaco-editor';
import Y from 'yjs';
import { MonacoBinding } from 'y-monaco';
interface CollaborativeEditorProps {
documentId: string;
userId: string;
userName: string;
}
export const CollaborativeEditor: React.FC<CollaborativeEditorProps> = ({
documentId,
userId,
userName,
}) => {
const editorRef useRef<monaco.editor.IStandaloneCodeEditor>(null);
const ydocRef = useRef<Y.Doc>();
const bindingRef = useRef<MonacoBinding>();
useEffect(() => {
// Initialize Yjs document
const ydoc = new Y.Doc();
ydocRef.current = ydoc;
// Connect to WebSocket provider
const provider = new WebsocketProvider(
'wss://your-server.com/ws',
documentId,
ydoc
);
// Get the shared text type
const ytext = ydoc.getText('code');
// Initialize Monaco editor
const editor = monaco.editor.create(document.getElementById('container')!, {
value: '',
language: 'typescript',
theme: 'vs-dark',
minimap: { enabled: false },
});
editorRef.current = editor;
// Bind Yjs to Monaco (syncs automatically)
const binding = new MonacoBinding(
ytext,
editor.getModel()!,
new Set([editor]),
provider.awareness
);
bindingRef.current = binding;
// Set user awareness (for cursors)
provider.awareness.setLocalStateField('user', {
name: userName,
color: getRandomColor(userId),
});
// Cleanup
return () => {
binding.destroy();
editor.dispose();
provider.destroy();
};
}, [documentId, userId, userName]);
return <div id="container" style={{ height: '100vh' }} />;
};
function getRandomColor(userId: string): string {
const colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7'];
const hash = userId.split('').reduce((a, b) => a + b.charCodeAt(0), 0);
return colors[hash % colors.length];
}
Handling Cursor Positions
// src/editor/CursorOverlay.tsx
import React, { useEffect, useState } from 'react';
import { WebsocketProvider } from 'y-websocket';
interface Cursor {
userId: string;
userName: string;
color: string;
position: { line: number; column: number };
selection?: { start: number; end: number };
}
export const CursorOverlay: React.FC<{ provider: WebsocketProvider }> = ({ provider }) => {
const [cursors, setCursors] = useState<Cursor[]>([]);
useEffect(() => {
const updateCursors = () => {
const states = provider.awareness.getStates();
const newCursors: Cursor[] = [];
states.forEach((state: any, userId: string) => {
if (userId === provider.clientID) return; // Skip self
if (state.cursor) {
newCursors.push({
userId,
userName: state.user?.name || 'Anonymous',
color: state.user?.color || '#FF6B6B',
position: state.cursor,
selection: state.selection,
});
}
});
setCursors(newCursors);
};
provider.awareness.on('change', updateCursors);
updateCursors();
return () => {
provider.awareness.off('change', updateCursors);
};
}, [provider]);
return (
<div style={{ position: 'absolute', top: 0, left: 0, pointerEvents: 'none' }}>
{cursors.map((cursor) => (
<CursorLabel
key={cursor.userId}
cursor={cursor}
/>
))}
</div>
);
};
const CursorLabel: React.FC<{ cursor: Cursor }> = ({ cursor }) => (
<div
style={{
position: 'absolute',
left: `${cursor.position.column * 8}px`, // Approximate monospace width
top: `${cursor.position.line * 20}px`, // Approximate line height
backgroundColor: cursor.color,
color: 'white',
padding: '2px 6px',
borderRadius: '3px',
fontSize: '12px',
whiteSpace: 'nowrap',
transform: 'translate(-2px, -28px)',
}}
>
{cursor.userName}
</div>
);
Step 3: WebSocket Server Implementation
Node.js Gateway with Yjs
// server/index.ts
import express from 'express';
import { createServer } from 'http';
import { WebSocketServer } from 'ws';
import Y from 'yjs';
import { YoctoStore } from 'y-leveldb';
import { readBuiltInDefaultState, updateState } from 'yjs';
const app = express();
const server = createServer(app);
const wss = new WebSocketServer({ server, path: '/ws' });
// In-memory store for active documents (use Redis in production)
const documents = new Map<string, Y.Doc>();
const clients = new Map<string, Set<WebSocket>>();
wss.on('connection', (ws, req) => {
const documentId = new URL(req.url!, 'http://localhost').searchParams.get('doc');
if (!documentId) {
ws.close(4000, 'Document ID required');
return;
}
// Get or create document
let ydoc = documents.get(documentId);
if (!ydoc) {
ydoc = new Y.Doc();
documents.set(documentId, ydoc);
clients.set(documentId, new Set());
// Load from persistence if exists
loadDocument(documentId).then((saved) => {
if (saved) {
Y.applyUpdate(ydoc, saved);
broadcastUpdate(documentId, saved);
}
});
}
clients.get(documentId)!.add(ws);
// Handle incoming updates
ws.on('message', (data) => {
const message = Buffer.from(data);
// Yjs protocol: first byte is message type
const messageType = message;
if (messageType === 0) {
// Sync step 1: Request state
const syncStep1 = Y.encodeSyncStep1(Y.getStateVector(ydoc));
ws.send(Buffer.concat([Buffer.from(), syncStep1]));
} else if (messageType === 1) {
// Sync step 2: Full state
const syncStep2 = Y.encodeSyncStep2(Y.encodeStateAsUpdate(ydoc));
ws.send(Buffer.concat([Buffer.from(), syncStep2]));
} else if (messageType === 2) {
// Update received from client
const update = message.slice(1);
Y.applyUpdate(ydoc, update);
// Persist asynchronously
persistDocument(documentId, Y.encodeStateAsUpdate(ydoc));
// Broadcast to other clients
broadcastUpdate(documentId, update, ws);
}
});
ws.on('close', () => {
clients.get(documentId)?.delete(ws);
// Clean up unused documents after delay
if (clients.get(documentId)?.size === 0) {
setTimeout(() => {
if (clients.get(documentId)?.size === 0) {
documents.delete(documentId);
clients.delete(documentId);
}
}, 5 * 60 * 1000); // 5 minutes
}
});
});
function broadcastUpdate(documentId: string, update: Buffer, exclude?: WebSocket) {
const updateMessage = Buffer.concat([Buffer.from(), update]);
clients.get(documentId)?.forEach((client) => {
if (client !== exclude && client.readyState === WebSocket.OPEN) {
client.send(updateMessage);
}
});
}
async function loadDocument(documentId: string): Promise<Buffer | null> {
// Query PostgreSQL for latest snapshot
const result = await db.query(
'SELECT snapshot FROM documents WHERE id = $1 ORDER BY created_at DESC LIMIT 1',
[documentId]
);
return result.rows?.snapshot || null;
}
async function persistDocument(documentId: string, update: Buffer) {
// Debounced persistence in production
await db.query(
'INSERT INTO documents (id, snapshot, updated_at) VALUES ($1, $2, NOW()) ' +
'ON CONFLICT (id) DO UPDATE SET snapshot = $2, updated_at = NOW()',
[documentId, update]
);
}
server.listen(3000, () => {
console.log('Collaborative editor server running on port 3000');
});
Step 4: Database Schema for Persistence
-- PostgreSQL schema
CREATE TABLE documents (
id VARCHAR(255) PRIMARY KEY,
snapshot BYTEA NOT NULL, -- Yjs binary format
language VARCHAR(50) DEFAULT 'typescript',
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
CREATE TABLE document_access (
id SERIAL PRIMARY KEY,
document_id VARCHAR(255) REFERENCES documents(id) ON DELETE CASCADE,
user_id VARCHAR(255) NOT NULL,
permission VARCHAR(20) CHECK (permission IN ('read', 'write', 'admin')),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(document_id, user_id)
);
CREATE INDEX idx_documents_updated ON documents(updated_at DESC);
CREATE INDEX idx_access_user ON document_access(user_id);
-- For monitoring active sessions
CREATE TABLE active_sessions (
session_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id VARCHAR(255) REFERENCES documents(id) ON DELETE CASCADE,
user_id VARCHAR(255) NOT NULL,
user_name VARCHAR(255),
connected_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
last_activity TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
CREATE INDEX idx_sessions_document ON active_sessions(document_id);
Step 5: Adding Rate Limiting and Authentication
// server/middleware/auth.ts
import { JWTPayload } from 'jws';
import { WebSocket } from 'ws';
export interface AuthenticatedWS extends WebSocket {
userId: string;
username: string;
}
export function authenticateWebSocket(
ws: WebSocket,
req: any
): AuthenticatedWS | null {
const token = req.url?.split('token=');
if (!token) {
ws.close(4001, 'Authentication required');
return null;
}
try {
const payload = verifyJWT(token) as JWTPayload & {
userId: string;
username: string;
};
const authenticatedWs = ws as AuthenticatedWS;
authenticatedWs.userId = payload.userId;
authenticatedWs.username = payload.username;
return authenticatedWs;
} catch (error) {
ws.close(4003, 'Invalid token');
return null;
}
}
// Rate limiting per document
const rateLimits = new Map<string, { count: number; resetAt: number }>();
export function rateLimit(documentId: string, maxUpdates: number = 100): boolean {
const now = Date.now();
const limit = rateLimits.get(documentId) || { count: 0, resetAt: now + 1000 };
if (now > limit.resetAt) {
limit.count = 0;
limit.resetAt = now + 1000;
}
if (limit.count >= maxUpdates) {
return false;
}
limit.count++;
rateLimits.set(documentId, limit);
return true;
}
Step 6: Production Deployment Architecture
Scaling Horizontally
┌─────────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (NGINX/ALB + WebSocket) │
└─────────────────────────┬───────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Gateway 1│ │ Gateway 2│ │ Gateway 3│
│ (Node) │ │ (Node) │ │ (Node) │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└───────────────┼───────────────┘
▼
┌──────────────────┐
│ Redis Cluster │
│ (Pub/Sub + │
│ State Sync) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ PostgreSQL │
│ (Primary + │
│ Read Replicas) │
└──────────────────┘
Redis Pub/Sub for Cross-Server Sync
// server/redis-adapter.ts
import Redis from 'ioredis';
import Y from 'yjs';
const redis = new Redis.Cluster([
{ host: 'redis-1.cluster.local', port: 6379 },
{ host: 'redis-2.cluster.local', port: 6379 },
{ host: 'redis-3.cluster.local', port: 6379 },
]);
const pubSubChannel = 'document-updates';
export function subscribeToDocumentUpdates(serverId: string) {
const subscriber = new Redis.Cluster([
{ host: 'redis-1.cluster.local', port: 6379 },
]);
subscriber.subscribe(pubSubChannel);
subscriber.on('message', (channel, message) => {
const { documentId, update, fromServer } = JSON.parse(message);
// Ignore updates from self
if (fromServer === serverId) return;
// Apply update to local document
const ydoc = documents.get(documentId);
if (ydoc) {
Y.applyUpdate(ydoc, Buffer.from(update, 'base64'));
// Broadcast to local clients
broadcastUpdate(documentId, Buffer.from(update, 'base64'));
}
});
}
export async function broadcastAcrossServers(
documentId: string,
update: Buffer
) {
const message = JSON.stringify({
documentId,
update: update.toString('base64'),
fromServer: process.env.SERVER_ID,
});
await redis.publish(pubSubChannel, message);
}
Step 7: Monitoring and Observability
// server/metrics.ts
import promClient from 'prom-client';
const register = new promClient.Registry();
const connectionGauge = new promClient.Gauge({
name: 'ws_connections_total',
help: 'Total active WebSocket connections',
labelNames: ['document_id'],
});
const updateCounter = new promClient.Counter({
name: 'document_updates_total',
help: 'Total document updates processed',
labelNames: ['document_id'],
});
const latencyHistogram = new promClient.Histogram({
name: 'update_latency_ms',
help: 'Time to process and broadcast update',
buckets: ,
});
register.registerMetric(connectionGauge);
register.registerMetric(updateCounter);
register.registerMetric(latencyHistogram);
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Usage in update handler
const startTime = Date.now();
// ... process update ...
const latency = Date.now() - startTime;
latencyHistogram.observe(latency);
updateCounter.inc({ documentId });
Performance Optimization Techniques
1. Debounced Persistence
// Don't write to database on every update
const persistenceQueue = new Map<string, { update: Buffer; timeout: NodeJS.Timeout }>();
function queuePersistence(documentId: string, update: Buffer) {
const existing = persistenceQueue.get(documentId);
if (existing) {
clearTimeout(existing.timeout);
}
const timeout = setTimeout(async () => {
await persistDocument(documentId, update);
persistenceQueue.delete(documentId);
}, 1000); // 1 second debounce
persistenceQueue.set(documentId, { update, timeout });
}
2. Snapshotting for Large Documents
async function createSnapshot(documentId: string) {
const ydoc = documents.get(documentId);
if (!ydoc) return;
const snapshot = Y.encodeStateAsUpdate(ydoc);
// Store incremental snapshot
await db.query(
'INSERT INTO snapshots (document_id, snapshot, created_at) VALUES ($1, $2, NOW())',
[documentId, snapshot]
);
// Keep only last 10 snapshots
await db.query(
`DELETE FROM snapshots
WHERE document_id = $1
AND created_at < (
SELECT created_at FROM snapshots
WHERE document_id = $1
ORDER BY created_at DESC
LIMIT 10 OFFSET 1
)`,
[documentId]
);
}
// Run every 5 minutes per document
setInterval(() => {
documents.forEach((_, documentId) => {
createSnapshot(documentId);
});
}, 5 * 60 * 1000);
3. Connection Pooling for PostgreSQL
import { Pool } from 'pg';
const db = new Pool({
host: process.env.DB_HOST,
port: parseInt(process.env.DB_PORT || '5432'),
database: process.env.DB_NAME,
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
max: 20, // Maximum clients in pool
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000,
});
// Monitor pool health
db.on('connect', (client) => {
console.log(`Client connected, pool size: ${db.totalCount}`);
});
db.on('remove', (client) => {
console.log(`Client removed, pool size: ${db.totalCount}`);
});
Testing Your Implementation
Load Testing Script
// tests/load-test.ts
import WebSocket from 'ws';
interface LoadTestConfig {
concurrentUsers: number;
durationMs: number;
documentId: string;
serverUrl: string;
}
async function runLoadTest(config: LoadTestConfig) {
const clients: WebSocket[] = [];
const updatesSent = { total: 0, failures: 0 };
const latencies: number[] = [];
console.log(`Starting load test: ${config.concurrentUsers} users for ${config.durationMs}ms`);
// Connect all clients
await Promise.all(
Array.from({ length: config.concurrentUsers }).map(async (_, i) => {
const ws = new WebSocket(
`${config.serverUrl}?doc=${config.documentId}&token=${generateTestToken(i)}`
);
await new Promise((resolve) => {
ws.on('open', resolve);
});
clients.push(ws);
// Send updates every 100ms
const interval = setInterval(() => {
const update = generateRandomUpdate();
const startTime = Date.now();
ws.send(Buffer.concat([Buffer.from(), Buffer.from(update)]));
ws.once('message', () => {
latencies.push(Date.now() - startTime);
updatesSent.total++;
});
}, 100);
ws.testInterval = interval;
})
);
// Run for duration
await new Promise((resolve) => setTimeout(resolve, config.durationMs));
// Cleanup
clients.forEach((ws) => {
clearInterval(ws.testInterval);
ws.close();
});
// Report results
const avgLatency = latencies.reduce((a, b) => a + b, 0) / latencies.length;
const p95Latency = latencies.sort()[Math.floor(latencies.length * 0.95)];
console.log(`Results:`);
console.log(` Total updates: ${updatesSent.total}`);
console.log(` Failures: ${updatesSent.failures}`);
console.log(` Avg latency: ${avgLatency.toFixed(2)}ms`);
console.log(` P95 latency: ${p95Latency}ms`);
console.log(` Updates/sec: ${(updatesSent.total / (config.durationMs / 1000)).toFixed(2)}`);
}
function generateRandomUpdate(): string {
const chars = 'abcdefghijklmnopqrstuvwxyz';
return Array.from({ length: 10 }, () =>
chars[Math.floor(Math.random() * chars.length)]
).join('');
}
runLoadTest({
concurrentUsers: 50,
durationMs: 30000,
documentId: 'load-test-doc',
serverUrl: 'ws://localhost:3000/ws',
});
Common Pitfalls and Solutions
| Problem | Symptom | Solution |
|---|---|---|
| Memory leak | Server RAM grows unbounded | Clean up documents after idle timeout; use Y.Doc().destroy()
|
| Eventual inconsistency | Users see different content | Ensure all updates are broadcast; add periodic state sync |
| High latency | >200ms update delay | Profile network; use Redis Cluster closer to servers; reduce payload size |
| Connection storms | All clients reconnect simultaneously | Implement exponential backoff (1s, 2s, 4s, 8s max) |
| Database overload | Postgres CPU at 100% | Debounce persistence; use write batching; add read replicas |
Security Considerations
// server/middleware/authorization.ts
export async function authorizeDocumentAccess(
ws: AuthenticatedWS,
documentId: string,
requiredPermission: 'read' | 'write' | 'admin'
): Promise<boolean> {
const result = await db.query(
`SELECT permission FROM document_access
WHERE document_id = $1 AND user_id = $2`,
[documentId, ws.userId]
);
if (result.rows.length === 0) {
ws.close(4003, 'Access denied');
return false;
}
const userPermission = result.rows.permission;
const permissionLevels = { read: 1, write: 2, admin: 3 };
if (permissionLevels[userPermission] < permissionLevels[requiredPermission]) {
ws.close(4003, 'Insufficient permissions');
return false;
}
return true;
}
Deployment Checklist
- [ ] Set up TLS/SSL for WebSocket connections (
wss://) - [ ] Configure CORS for WebSocket origin
- [ ] Implement health checks (
/healthendpoint) - [ ] Set up monitoring (Prometheus + Grafana)
- [ ] Configure log aggregation (ELK stack or Datadog)
- [ ] Set up alerting for connection drops and latency spikes
- [ ] Implement graceful shutdown (drain connections before restart)
- [ ] Configure database connection pooling appropriately
- [ ] Set up Redis Cluster with persistence enabled
- [ ] Implement backup strategy for PostgreSQL
- [ ] Add rate limiting per user/IP
- [ ] Set up auto-scaling based on connection count
Conclusion
Building a real-time collaborative code editor requires careful attention to conflict resolution, latency, and scalability. By using Yjs CRDTs, you get mathematically-guaranteed convergence without complex operational transformation logic. The key architectural decisions are:
- Use CRDTs over OT for simpler correctness guarantees
- Persist asynchronously with debouncing to avoid database bottlenecks
- Scale horizontally with Redis Pub/Sub for cross-server synchronization
- Monitor aggressively since distributed systems have more failure modes
- Test under load before production-real-time systems behave differently with 50+ concurrent users
The architecture presented here handles hundreds of concurrent users on a single server cluster. For thousands of users, add more gateway instances behind a load balancer and scale Redis/PostgreSQL independently.
Next Steps
- Add code folding and syntax highlighting per language
- Implement operational commands (undo/redo across users)
- Add video/audio calling alongside code collaboration
- Build remote pair programming with shared terminal access
- Create version history with diff viewing
This system is production-ready and can serve as the foundation for products like CodeSandbox, Replit, or Google Docs for code.
Rizwan Saleem — https://rizwansaleem.co
Top comments (0)