DEV Community: Victor Ayoola

How My Team from Risevest Academy and I Built an End-to-End Encrypted Messaging App in 3 Weeks

Victor Ayoola — Thu, 28 May 2026 23:02:13 +0000

We were each asked to come up with an Idea of a project we would like to build. I have always wondered what the tech behind messaging platforms is like, so for me it was easy. Build a secure messaging platform. After a bit of research, I thought I understood what that meant. After the past three weeks, having built something I'm really proud of, I can tell you that I had no idea what I was getting into. Not in a bad way, but in a way that meant I was bound to learn.

This is the story of how a team of four backend developers built a real-time, end-to-end encrypted messaging platform with WebSocket messaging, media sharing, push notifications, and a functioning encryption layer.

Where We Started

The idea was initially simple; build something like WhatsApp, but with privacy as the foundation. Messages had to be encrypted on the sender's device and decrypted only on the recipient's. So not even our servers could reach the messages.

That last requirement changed everything about how my vision for the system looked. Building a simple CRUD app is relatively easy, REST APIs, maybe a real-time feature here and there. E2EE is a different ball game entirely. It forces you to think about different boundaries, what does the server know? What should it know? And what is off limit?

Before we started actually coding, we spent time on three questions: what were we actually building, what did each piece of the stack need to do, and what were the dependencies between them?

The Tech Stack and Why We Chose It

We landed on Node.js, TypeScript, Express, PostgreSQL, Redis, and Socket.io for the backend.

TypeScript with strict mode was non-negotiable. With four people writing backend code simultaneously as well as working on the frontend, type safety is what keeps everyone honest. Every API response shape, every socket event payload, every database model is fully typed. When a response shape changes the frontend catches it at compile time.

Express.js was a deliberate choice. We considered other choices - briefly, but Express felt like the obvious choice: every developer on the team already knows it - for one. We built a clean middleware stack on top of it — validation, JWT auth, error handling and request logging.

PostgreSQL and Knex because relational data fits our domain. Users, conversations, members, messages — these have clear relationships and constraints. Knex gave us type-safe migrations and query building without the magic of a full ORM. We wrote our migrations in dependency order and never had a schema conflict.

Socket.io with Redis adapter for real-time messaging. Socket.io's room abstraction maps perfectly to chat conversations — a socket joins the conversation room, messages are emitted to the room, everyone in it receives them. The Redis adapter is what allows this to scale: when you have multiple server instances, a message arriving at server A needs to reach a user connected to server B. Redis pub/sub is the channel between them.

BullMQ for offline message delivery. When a recipient is offline, the message queues. A background worker processes the queue and triggers push notifications. Jobs retry automatically with exponential backoff. This is the kind of infrastructure that feels like overkill until the day it saves you.

Firebase for three separate things: Authentication through firebase phone verification. Cloud Messaging handles push notifications. We originally planned to use Firebase Storage for media but switched to Cloudinary midway through because of its built-in transformation pipeline.

The Architecture Decision That Defined Everything

Early in week one we made a decision that shaped everything that followed: the server would store ciphertext and nothing else.

This sounds obvious for an E2EE app but the implications run deep. It means:

The backend has no business logic that depends on message content
Encryption and decryption happen entirely on the client
If our database is breached, the attacker gets encrypted blobs they cannot read
The server cannot comply with a request to hand over message contents because it genuinely does not have them

We used the Web Crypto API on the frontend — ECDH for key agreement and AES-256-GCM for symmetric encryption. Private keys are stored in IndexedDB as non-extractable CryptoKey objects. So even JavaScript cannot read them back out. They can only be used to perform cryptographic operations.

Here is what that looks like in practice. When a user sends a message:

The frontend fetches the recipient's public key from the backend
An ephemeral key pair is generated for this specific message
ECDH key agreement derives a shared secret using the ephemeral private key and the recipient's public key
The message is encrypted with AES-256-GCM using that shared secret
The encrypted blob is sent to the backend
The backend stores it without reading it
The recipient's frontend receives the blob, derives the same shared secret from the other side, and decrypts locally

We originally planned to use the Signal Protocol library on both frontend and backend. We ran into a hard wall, which was that libsignal is a Node.js library with native bindings that does not run in the browser. The frontend had to use a different approach. We chose the Web Crypto API — it's built into every modern browser, has no dependencies, and the non-extractable key storage in IndexedDB was more secure than what most apps do.

The 3-Week Sprint

We split the work across four backend developers with clear ownership each week.

Week one was foundation — Express scaffold, TypeScript config, database migrations, Firebase Auth integration, JWT session management, and the encryption key management infrastructure. The primary thing in week one was to get the auth middleware right. Everything else depended on it. We used Firebase Auth for phone number verification and OTP, then exchanged the Firebase ID token for our own internal JWT pair. Fifteen-minute access tokens, thirty-day refresh tokens and silent renewal on 401 responses.

Week two was the core product — Socket.io server with Redis adapter, message routing, message persistence, delivery status tracking, group messaging, media pipeline, and push notifications. This was the hardest week. Getting real-time messaging right requires thinking about state in multiple places simultaneously: the socket connection registry in Redis, the message status in PostgreSQL, the optimistic UI on the frontend, the BullMQ queue for offline delivery. We had moments where all four of us were debugging the same flow from different angles.

Week three was integration. The frontend and backend had been built largely side by side and week three was where we wired them. This is where the choice to use TypeScript paid off most visibly — the type definitions meant most things just worked, and where they didn't the errors were specific, easy to trace and fixable.

What Actually Went Wrong

I want to be honest about the things that did not go smoothly.

The authentication flow this wasn't much of an issue but we most definitely struggled with finalizing and understanding the authentication flow on registration, and how to properly integrate firebase auth into the mix.

The libsignal mismatch cost us a lot of time. We had built the backend key validation assuming the frontend would use Signal Protocol. When we discovered libsignal doesn't run in browsers we had to strip the validation out of the keys routes and rethink the frontend encryption approach. The lesson: validate your stack choices before you build anything on top of them.

Route ordering in Express this one hurt a bit, and genuinely had me thinking I was losing it. In reality, it is a genuinely easy mistake to make and we made it. Parameterised routes like /:id will swallow everything that comes after them if registered in the wrong order. GET /users/level-up must come before GET /users/:id or search is unreachable — Express matches top to bottom and has no way to know that search is not a UUID. We caught this problem during integration and it was quite mind boggling, one of those mistakes I most definitely won't be making again.

What I Would Do Differently

Start with a shared Type contract. A shared type contract would have helped avoid some of the issues we faced while building and integrating the front and backend, as it ensures both sides are always on the same page about how data is sent and received.

Do a dependency audit before starting. Know which tasks are truly parallel and which are blocked.

Proper research on stack choice. This is definitely a hard check for me from now on, even when it feels like you're using the industry standard. It's always best to ensure what you go with aligns perfectly with your goals and works properly with all other tools being used in the project.

What I'm Most Proud Of

Building this in three weeks, with four developers on a project of this complexity, We pushed to see it through and it most definitely wasn't smooth sailing all through but we were able to pull it off.

The Stack in One Place

For anyone who wants the full picture of our full tech stack, this is what it looked like:

Backend: Node.js, TypeScript, Express.js, PostgreSQL, Knex, Redis, Socket.io, BullMQ, Firebase Admin SDK, Cloudinary, Zod, jose, Multer

Frontend: React, TypeScript, Web Crypto API, IndexedDB, Socket.io client, Firebase Auth SDK, Axios

Infrastructure: Redis Cloud, Cloudinary, Firebase (Auth + FCM), PostgreSQL

What's Next

The backend is built to support Android — the API doesn't care whether the client is a browser or a mobile app. React Native is next.

Voice and video calls are Phase 2. WebRTC for peer-to-peer, our signalling server already handles the connection infrastructure.

Full Signal Protocol on the frontend is something I want to revisit. The Web Crypto API approach works and is genuinely secure but Signal's Double Ratchet algorithm provides forward secrecy that our current implementation lacks — if a key is compromised, past messages should remain safe. That's the next cryptographic milestone.

If you're thinking about building something similar, my advice is this: make the hard architectural decisions first. For us, the decision that the server would never read message content was not a negotiable. Everything else followed from it. Start with your constraints, not your components.

The code is messy in places, some corners were cut, and the UI could definitely do with some improvement. But it works, and messages encrypted on one device are being decrypted on another without our server ever knowing what was said.

That's the thing we set out to build. That's the thing we built.

I am a fullstack developer who built the backend architecture (alongside my team) and the frontend for Culver as part of a 3-week mentorship project. These are the repo links, Backend, Frontend.

Staying Open for Business on the Busiest Internet Shopping Day

Victor Ayoola — Fri, 03 Apr 2026 10:45:52 +0000

Your e-commerce store has a sale scheduled for midnight. You've spent weeks preparing: discounted inventory loaded, email campaign fired, social media countdown ticking. At 11:59 PM, traffic is normal. At 12:00:01 AM, thirty thousand users simultaneously click "Shop Now."

Your single-process Node.js server gets hit with a wall of concurrent requests—product lookups, cart operations, inventory checks, checkout flows. The event loop, which was humming along handling a few dozen requests per second, is now buried under thousands. Response times balloon from 80ms to 8 seconds. Then your server crashes. Thirty thousand customers see a blank screen. Your sale is over before it started.

This is not a hypothetical. It happens every Black Friday, to real businesses, running exactly the kind of code most tutorials teach you to write.

The fix is not simply "get a bigger server." It is a fundamental architectural upgrade: Node.js clustering with graceful shutdown. By the end of this article, you will understand how to use every CPU core your machine has, how to deploy new code without a single dropped request, and how to survive a traffic spike that would kill a single-process app.

Why one process is not enough

Node.js runs on a single thread. That is usually its superpower: instead of spawning a new OS thread per request (expensive, slow), Node handles thousands of concurrent connections through non-blocking I/O and its event loop.

But here is the catch: on a machine with 8 CPU cores, a single Node.js process uses exactly one. The other seven sit idle, doing nothing, while your one overwhelmed process burns through its queue.

The core problem: CPU-bound operations—checkout calculation, inventory validation, price aggregation—block the event loop. When thirty thousand users hit these simultaneously, everyone waits behind everyone else. On a single process, you have one lane of traffic. Clustering gives you eight.

The cluster module, built into Node.js since v0.8, solves this. It lets you fork multiple worker processes from a single master, each running your app code independently, each handling its own slice of incoming traffic. The OS load-balances connections across them. You go from one cashier to eight, instantly.

But clustering alone is not enough. The harder problem is: what happens when you need to deploy a bug fix during the sale? Restarting all eight workers simultaneously drops every in-flight request. Customers lose their cart. Orders fail mid-checkout. You need zero-downtime deployment—the ability to swap out running workers one at a time, without ever leaving the server unable to accept requests.

That is the graceful shutdown problem, and it is the heart of what we are building.

Architecture first

Before writing code, understand the shape of what we are building:

Process	Role	Count
`Master`	Forks workers, watches for exits, handles signals, orchestrates rolling restarts	1
`Worker`	Runs your actual Express/HTTP app, handles requests, drains on shutdown	N (one per CPU core)

The master never handles HTTP traffic. It is a process manager—pure lifecycle control. Workers never know how many siblings they have. They just accept connections and serve requests.

Step 1: The master process

// cluster-master.js
const cluster = require('cluster');
const os = require('os');

const WORKER_COUNT = os.cpus().length; // Use all available CPU cores
const workers = new Map();             // Track worker pid -> worker object

function spawnWorker() {
  const worker = cluster.fork();
  workers.set(worker.process.pid, worker);

  worker.on('exit', (code, signal) => {
    workers.delete(worker.process.pid);

    // Auto-respawn unless we sent SIGTERM intentionally
    if (signal !== 'SIGTERM') {
      console.log(`Worker ${worker.process.pid} crashed. Respawning...`);
      spawnWorker();
    }
  });

  console.log(`Worker ${worker.process.pid} started`);
  return worker;
}

if (cluster.isPrimary) {
  console.log(`Master ${process.pid} starting ${WORKER_COUNT} workers`);
  for (let i = 0; i < WORKER_COUNT; i++) spawnWorker();

  // SIGTERM kicks off a rolling restart
  process.on('SIGTERM', () => rollingRestart());
}

Notice the workers Map. We track every live worker by PID. When a worker exits because of a crash (not our intentional SIGTERM), we automatically respawn it. This is your self-healing loop—Black Friday traffic crashes rarely take down all eight workers simultaneously.

Step 2: The rolling restart

This is the mechanism that makes zero-downtime deployment possible. Instead of killing all workers at once, we cycle through them one at a time: tell a worker to stop accepting new requests, wait for it to finish its current work, kill it, spawn a replacement, wait for it to be ready, then move to the next worker.

// Still in cluster-master.js
async function rollingRestart() {
  console.log('Starting zero-downtime rolling restart...');
  const workerList = [...workers.values()];

  for (const worker of workerList) {
    await new Promise((resolve) => {

      // 1. Tell this worker to stop accepting new connections
      //    and drain its existing ones
      worker.send({ type: 'SHUTDOWN' });

      // 2. Spawn a replacement immediately so capacity is maintained
      spawnWorker();

      // 3. Wait for the old worker to exit cleanly
      worker.once('exit', () => {
        console.log(`Worker ${worker.process.pid} drained and exited`);
        resolve();
      });

      // 4. Safety valve: force kill after 30s if still hanging
      setTimeout(() => {
        if (!worker.isDead()) {
          console.warn(`Worker ${worker.process.pid} timed out. Force killing.`);
          worker.kill('SIGKILL');
        }
      }, 30000);
    });
  }

  console.log('Rolling restart complete. All workers refreshed.');
}

Why spawn the replacement before waiting for the old worker to exit? If you kill a worker first, then spawn the replacement, there is a gap where you have one fewer worker handling traffic. During a Black Friday spike, that gap matters. By spawning first and waiting second, capacity is maintained throughout the entire restart cycle.

Step 3: The worker — graceful shutdown logic

The worker is where your actual HTTP application runs. It receives the SHUTDOWN message from the master and must: stop the server from accepting new connections, wait for all in-flight requests to finish, then exit cleanly.

// worker.js
const http = require('http');
const express = require('express');

const app = express();
let isShuttingDown = false;
let activeRequests = 0;

// Middleware to reject new requests during shutdown
app.use((req, res, next) => {
  if (isShuttingDown) {
    // Tell load balancers/clients not to reuse this connection
    res.setHeader('Connection', 'close');
    return res.status(503).json({
      error: 'Server is restarting. Please retry.',
    });
  }
  activeRequests++;
  res.on('finish', () => activeRequests--);
  next();
});

// Your actual routes go here
app.get('/products/:id', (req, res) => {
  res.json({ id: req.params.id, name: 'Running Shoes', price: 79.99 });
});

app.post('/checkout', async (req, res) => {
  // In production: validate inventory, process payment, write order
  await new Promise(r => setTimeout(r, 50)); // Simulated async work
  res.json({ orderId: `ORD-${Date.now()}`, status: 'confirmed' });
});

const server = http.createServer(app);
server.listen(3000, () => {
  console.log(`Worker ${process.pid} listening on port 3000`);
});

// Listen for shutdown signal from master
process.on('message', (msg) => {
  if (msg.type !== 'SHUTDOWN') return;

  console.log(`Worker ${process.pid} beginning graceful shutdown...`);
  isShuttingDown = true;

  // Stop accepting new TCP connections
  server.close(() => {
    console.log(`Worker ${process.pid} HTTP server closed`);
    process.exit(0);
  });

  // Poll for active requests to drain
  const drainCheck = setInterval(() => {
    if (activeRequests === 0) {
      clearInterval(drainCheck);
      console.log(`Worker ${process.pid} all requests drained. Exiting.`);
      process.exit(0);
    }
    console.log(`Worker ${process.pid} waiting: ${activeRequests} active requests`);
  }, 500);
});

Three things to notice here:

isShuttingDown is checked in middleware—before your routes run. Any request that arrives after shutdown is triggered gets a 503 immediately, not a hanging connection that never resolves.

activeRequests is tracked per-worker using simple increment/decrement on res.finish. This is lightweight and accurate for HTTP/1.1 request lifecycles.

There are two paths to process.exit(0): the server.close() callback (fires when the server's keep-alive connections all close), and the active-request drain poll. Whichever fires first wins. The 30-second SIGKILL in the master is the nuclear option if neither fires.

Step 4: Wire it all together

// index.js — single entry point
const cluster = require('cluster');

if (cluster.isPrimary) {
  require('./cluster-master');
} else {
  require('./worker');
}

Run it with node index.js. On an 8-core machine, you get 8 workers. To trigger a zero-downtime restart during a deploy:

kill -SIGTERM $(cat app.pid)

The Black Friday simulation: what this actually achieves

At 12:00 AM, connections arrive. The OS distributes them across 8 workers. Each worker handles its slice of traffic independently—no shared memory, no lock contention. Worker 1 processing a checkout has no impact on Worker 4 processing a product lookup.

At 12:07 AM, your team spots a bug in the discount calculation. They push a fix. Deployment triggers SIGTERM on the master.

The rolling restart begins. Worker 1 receives SHUTDOWN. It stops accepting new requests. The 3 requests currently in flight finish—they take about 60ms combined. Worker 1 exits. A fresh Worker 9 starts, running the patched code, and begins accepting traffic. Then Worker 2. Then Worker 3. By 12:09 AM, all 8 workers are running the patched code. Not a single in-flight request was dropped. No customer saw an error.

Scenario	Single Process	Cluster + Graceful Shutdown
CPU utilization (8-core machine)	❌ 12.5% (1 of 8 cores)	✅ ~100% (all cores active)
Requests dropped on deploy	❌ All in-flight requests	✅ Zero
Worker crash recovery	❌ Full outage	✅ Auto-respawn, 7 workers continue
Throughput under spike	❌ Single queue, degrades fast	✅ Distributed queue, scales linearly
Deployment strategy	❌ Stop → deploy → restart	✅ Rolling restart, no downtime window

What about load balancers and PM2?

In a real production system with multiple machines, you would typically have a load balancer (NGINX, AWS ALB) in front, and a process manager like PM2 handling clustering and restarts. PM2's pm2 reload command does almost exactly what we built above.

But here is why understanding the raw cluster module matters: PM2 and cloud load balancers are abstractions over exactly this machinery. When a rolling restart goes wrong in production—and eventually it will—you need to know what signal the master is sending, why a worker is not draining, and what the 30-second SIGKILL safety valve is protecting against. You cannot debug an abstraction you have never looked underneath.

Production note: For single-machine deployments, use the raw cluster module or PM2 in cluster mode (pm2 start index.js -i max). For multi-machine deployments, let your orchestration layer (Kubernetes, ECS) handle node-level scaling, and use the SIGTERM graceful shutdown logic in your worker for zero-downtime pod replacement. The worker shutdown code in Step 3 applies identically to both cases.

Key takeaways

01 — Clustering is basic resource utilization. Node.js is single-threaded. On a multi-core machine, a single process wastes most of your hardware. Clustering is not premature optimization.

02 — Graceful shutdown is not just server.close(). It means tracking in-flight requests, rejecting new ones with 503, draining the queue, then exiting.

03 — Spawn before you kill. In a rolling restart, create the replacement worker before waiting for the old one to exit. One fewer worker during a traffic spike is a real cost.

04 — Always set a SIGKILL timeout. A worker that holds open a database connection or hangs in a third-party API call will never drain. The safety valve is not optional.

E-commerce failures during peak traffic are almost always architectural, not hardware problems. The stores that stay up on Black Friday are not running bigger servers—they are running smarter code.

The cluster module and graceful shutdown logic are not advanced topics. They are table stakes for any Node.js application that real people depend on.

Your store is ready. Let's get to selling.