I was working on iTicket.AZ — a backend service for real-time event ticketing, built with Node.js and TypeScript — when I came across a job posting at a major bank. Their requirement: "build scalable, resilient, and fault-tolerant applications."
I looked at my own backend and asked honestly: is this fault-tolerant? The answer was no. The server had no health awareness, no service discovery, no restart policy, and no automated build verification. This post is about exactly what I fixed — with real code from the project.
Problem 1 — The backend had no health awareness
When the database went down, the backend kept accepting HTTP requests and silently failing all of them. No signal to any external system.
app.get("/api/v1/health", async (_req, res) => {
const dbOk = await AppDataSource.query("SELECT 1")
.then(() => true)
.catch(() => false);
res.status(dbOk ? 200 : 503).json({
status: dbOk ? "healthy" : "degraded",
checks: {
database: dbOk ? "up" : "down",
uptime: Math.floor(process.uptime()),
},
service: "iticket-api",
timestamp: new Date().toISOString(),
});
});
A 200 means everything is working. A 503 means the service is alive but impaired. Any infrastructure tool — Consul, a load balancer, Kubernetes — can now make routing decisions based on this response.
Problem 2 — No service discovery
Even with a health endpoint, nothing was calling it. HashiCorp Consul polls /api/v1/health every 10 seconds. If it receives a 503, it marks the instance critical and deregisters it after 30 seconds — automatically.
async function registerWithConsul(): Promise<void> {
try {
await consulClient.agent.service.register({
name: "iticket-api",
address: "api",
port: Number(appConfig.PORT),
check: {
name: "iticket-api health",
http: `http://api:${appConfig.PORT}/api/v1/health`,
interval: "10s",
timeout: "3s",
deregistercriticalserviceafter: "30s",
},
} as any);
} catch (err) {
// Non-fatal — app starts normally in local dev without Consul
console.warn("Consul registration skipped:", err);
}
}
The try/catch is intentional. Making the registration failure non-fatal is itself a fault-tolerance decision: the monitoring layer going down should not take the application with it.
Problem 3 — A single crash meant a permanent outage
services:
api:
build: .
restart: unless-stopped
depends_on:
postgres:
condition: service_healthy
environment:
DB_HOST: postgres
postgres:
image: postgres:16-alpine
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${DB_USERNAME} -d ${DB_NAME}"]
interval: 10s
retries: 5
restart: unless-stopped
consul:
image: hashicorp/consul:1.18
ports:
- "8500:8500"
restart: unless-stopped
PostgreSQL must pass its own health check before the API starts. If the API crashes, Docker restarts it. Consul monitors it continuously.
Problem 4 — Broken builds reached the repository undetected
name: iTicket.AZ CI
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20', cache: 'npm' }
- run: npm ci
- run: npm install typescript --save-dev
- run: npx tsc --noEmit
- run: npx tsc
docker:
needs: build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: docker build -t iticket-api:${{ github.sha }} .
How the pieces connect
git push
→ GitHub Actions: type check + Docker build
↓ passes
→ docker-compose up
PostgreSQL → health check passes
↓ healthy
iticket-api → registers with Consul
↓
Consul polls /api/v1/health every 10s
503 → marks critical → deregisters after 30s
crash → Docker restarts automatically
A health endpoint is useless without something polling it. Consul is useless without something to register with it. A restart policy is useless if the app starts before the database is ready. The pieces only become fault-tolerant as a system when they are connected.
Top comments (0)