While building iTicket.AZ — a real-time event ticketing platform — I came across a job posting from a major bank that listed "building scalable, resilient, and fault-tolerant applications" as a core requirement. That made me think: is my backend actually fault-tolerant? Spoiler: it wasn't. Here's what I changed.
What does "fault-tolerant" actually mean?
A fault-tolerant system keeps running — even in degraded form — when parts of it fail. That means your app doesn't crash just because the database hiccuped, a third-party API timed out, or a job queue backed up. There are four patterns I focused on.
Pattern 1 — Retry + Circuit Breaker
When a DB write fails, should we silently drop it? No — but we also shouldn't hammer a broken service forever. The retry pattern tries again a few times; the circuit breaker stops calls entirely after too many failures.
import CircuitBreaker from 'opossum';
const dbOptions = {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000,
};
const breaker = new CircuitBreaker(saveTicketToDB, dbOptions);
breaker.fallback(() => ({
success: false,
message: 'Service temporarily unavailable. Try again shortly.'
}));
export const createTicket = async (data) => {
return await breaker.fire(data);
};
Now instead of hanging requests, your users get a clean error immediately. Library: opossum.
Pattern 2 — Graceful Degradation
If the chat service (Socket.IO) goes down, should ticket purchasing stop too? Absolutely not. Each feature should fail independently.
export const getEventDetails = async (eventId: string) => {
const [event, chatStatus] = await Promise.allSettled([
EventService.findById(eventId),
ChatService.getStatus(eventId),
]);
return {
event: event.status === 'fulfilled' ? event.value : null,
chatAvailable: chatStatus.status === 'fulfilled',
};
};
Promise.allSettled is the key here — unlike Promise.all, it doesn't throw if one promise rejects.
Pattern 3 — Health Checks + Structured Logging
app.get('/health', async (req, res) => {
const checks = {
database: await checkDB(),
uptime: process.uptime(),
timestamp: new Date().toISOString(),
};
const allOk = Object.values(checks).every(Boolean);
res.status(allOk ? 200 : 503).json(checks);
});
const log = (level: string, message: string, meta = {}) => {
console.log(JSON.stringify({
level, message, ...meta,
service: 'iticket-api',
ts: new Date().toISOString(),
}));
};
Pattern 4 — Queue + Async Processing
import Queue from 'bull';
const emailQueue = new Queue('ticket-emails', process.env.REDIS_URL);
export const purchaseTicket = async (req, res) => {
const ticket = await TicketService.create(req.body);
await emailQueue.add({ ticketId: ticket.id, userEmail: req.body.email });
res.status(201).json({ success: true, ticket });
};
emailQueue.process(async (job) => {
await EmailService.sendConfirmation(job.data);
});
Even if your main server crashes after responding, Bull will re-run the job when it comes back up.
The result
These four patterns transformed iTicket.AZ from a "works on my machine" project into something I'd actually put in front of an interviewer. The concepts map directly to what enterprise teams look for when they say "scalable, resilient, and fault-tolerant."
Top comments (0)