My hands were shaking as I watched the notification counter climb: 5,000 sent... 8,000 sent... 12,000 sent. I had just launched a push notification to every user in the database with a YouTube Live stream URL. The problem? The URL worked perfectly on desktop web, but mobile users were seeing error screens instead of the livestream.
The Planning Dept. had sent me the wrong URL format—one that wasn't mobile-compatible. Within minutes, my team lead called: "Stop the notification immediately. Mobile users can't access the stream."
I had about 90 seconds to make a decision. My only option at the time? Force-stop the Docker container and pray.
That terrifying moment taught me something crucial: when you're sending mass notifications, you need an elegant cancellation mechanism, not emergency infrastructure shutdowns. In this post, I'll show you how I built a graceful cancellation system using Redis coordination flags that lets me stop notification campaigns mid-flight without destroying my entire worker infrastructure.
The YouTube Live incident
Here's what happened that day. The marketing team planned a major YouTube Live event and needed to notify all app users. They requested a push notification with the livestream link. I received the URL from Planning Dept., tested it quickly on my desktop browser—worked fine—and queued the notification job for all 50,000+ active users.
The notification sending started smoothly. Then, about three minutes in, my team lead called urgently. The marketing team had just discovered that the YouTube URL format didn't work on mobile devices. Desktop users could watch the stream perfectly, but mobile app users kept getting redirected to the YouTube channel page instead of live stream page.
The longer format didn't properly deep-link into the YouTube mobile app. By the time my team lead gave the order to stop, 12,000 notifications had already been sent. I had no choice—I needed to stop the remaining 38,000 notifications from going out.
My only option: emergency Docker shutdown.
I SSH'd into the EC2 instance and ran:
docker stop messaging-worker
The aftermath was messy:
- Some users received the broken notification, others didn't
- I had no clean record of who got what
- The marketing team had to manually coordinate a follow-up campaign with the correct URL
- Other legitimate background jobs were interrupted
- I spent the next hour restarting services and explaining what happened
I knew this couldn't happen again. There had to be a better way.
The real problem with Docker restarts
When I force-stopped that container, I created cascading failures throughout my infrastructure:
Active jobs get abandoned mid-execution. The notification worker was processing batch 120 of 500 when I killed it. That progress vanished completely. BullMQ marked the job as "stalled" and wouldn't retry for an hour by default. Meanwhile, I had no audit trail showing which 12,000 users received the broken notification—critical information for the follow-up campaign.
Other legitimate jobs die too. My notification service handles more than just push notifications. It processes:
- Scheduled email campaigns (abandoned mid-send)
- SMS verification codes (users waiting for login codes that never arrived)
- In-app message distribution (promotional content stuck in limbo)
- Weekly digest emails (interrupted halfway through user segments)
Killing the container stopped everything, not just the problematic YouTube notification campaign.
State preservation becomes impossible. Without graceful shutdown, I couldn't:
- Save checkpoints showing partial completion
- Update progress counters for monitoring
- Log which specific recipients received notifications
- Mark the job as "cancelled by admin" vs "failed due to error"
Debugging became archaeology—I spent three hours reconstructing what happened by correlating timestamps across application logs, FCM delivery receipts, and database activity logs.
The stress was unbearable. For 20 minutes after stopping the container, I didn't know if I'd made things worse. Had I corrupted the Redis queue? Would the job retry and send duplicate notifications? Were other services impacted? My team lead was waiting for updates, and I had no clear answers.
The solution I built: API-driven cancellation
After that incident, I made it my priority to build a proper cancellation system. The requirements were clear:
- Stop problematic campaigns immediately - No waiting for jobs to finish naturally
- Zero collateral damage - Other background jobs must continue unaffected
- Clean audit trails - Know exactly who received notifications before cancellation
- Team lead can trigger it - Simple API call, no SSH access needed
The architecture I designed uses Redis as a coordination layer between my API servers and worker processes.
The new workflow:
- Team lead (or any admin) calls
DELETE /message/cancel/{jobId}from any API client - The API sets a cancellation flag in Redis with a TTL
- Workers check this flag at strategic checkpoints during execution
- When detected, workers exit gracefully and log exactly what was completed
- Team lead receives immediate confirmation with partial completion stats
This decouples cancellation (an API operation) from job execution (a worker operation), allowing clean coordination across process boundaries. No SSH required, no container restarts, no collateral damage.
Architecture: Redis as the coordination layer
My stack consists of NestJS API servers that enqueue jobs, BullMQ workers that process them, and Redis as both the queue backend and coordination layer. The cancellation system uses three simple Redis operations:
// utils/cancellation-flag.ts
import { RedisService } from 'src/services/redis.service';
export async function setCancellationFlag(
redisService: RedisService,
jobId: string,
ttlSeconds: number = 3600
): Promise {
const key = `push:cancel:${jobId}`;
await redisService.set(key, '1', ttlSeconds);
console.log(`[setCancellationFlag] Job ${jobId} cancellation flag set (TTL: ${ttlSeconds}s)`);
}
export async function isCancellationFlagSet(
redisService: RedisService,
jobId: string
): Promise {
const key = `push:cancel:${jobId}`;
const value = await redisService.get(key);
return value === '1';
}
export async function clearCancellationFlag(
redisService: RedisService,
jobId: string
): Promise {
const key = `push:cancel:${jobId}`;
await redisService.del(key);
console.log(`[clearCancellationFlag] Job ${jobId} cancellation flag cleared`);
}
The one-hour TTL is deliberate. Notification jobs typically complete within 30 minutes, so an hour provides safety margin without cluttering Redis with permanent flags. If a job completes normally, the flag expires harmlessly after an hour.
Why Redis instead of database flags? Three reasons:
- Latency - Redis GET operations take <1ms vs 10-50ms for database queries
- Frequency - Workers check cancellation flags every few seconds during long jobs
- TTL support - Automatic cleanup without cron jobs or manual deletion
Two cancellation scenarios that matter
Not all jobs are equal when it comes to cancellation. My implementation handles two distinct scenarios:
Scenario 1: Queued jobs (immediate removal)
Jobs waiting in the delayed or waiting states haven't started processing yet. These can be removed from the queue entirely with no side effects:
// controller/firebase.controller.ts
@Delete('cancel/:jobId')
async cancelNotification(@Param('jobId') jobId: string) {
const job = await this.pushQueue.getJob(jobId);
if (!job) {
return {
status: 'not_found',
message: 'Job not found in queue',
};
}
const state = await job.getState();
console.log(`[cancelNotification] Job ${jobId} state: ${state}`);
// Immediate removal for queued jobs
if (state === 'waiting' || state === 'delayed') {
await job.remove();
console.log(`[cancelNotification] Job ${jobId} removed from queue (state: ${state})`);
return {
status: 'removed',
message: 'Job removed from queue before processing started',
};
}
// Graceful cancellation for active jobs
if (state === 'active') {
await setCancellationFlag(this.redisService, jobId, 3600);
console.log(`[cancelNotification] Job ${jobId} cancellation flag set`);
return {
status: 'cancelling',
message: 'Cancellation requested. Job will stop at next checkpoint.',
};
}
// Already completed jobs
if (state === 'completed') {
return {
status: 'already_completed',
message: 'Job already finished, cannot cancel',
};
}
// Failed jobs
if (state === 'failed') {
return {
status: 'already_failed',
message: 'Job already failed, no cancellation needed',
};
}
throw new BadRequestException(`Cannot cancel job in state: ${state}`);
}
This approach prevents unnecessary work. Why let a doomed job start executing when I can pull it from the queue entirely?
Real-world impact: In the YouTube Live incident, if the system had been in place, my team lead could have cancelled the job after 12,000 sends while it was queued for the next batch. No container restart, no collateral damage, clean audit trail.
Scenario 2: Active jobs (graceful completion)
Active jobs require nuance. A worker might be halfway through sending 10,000 notifications when the cancellation request arrives. Here's the critical constraint: I cannot recall notifications already sent to Firebase Cloud Messaging. Once FCM receives the request, it's in flight and cannot be cancelled.
My solution: let the current chunk complete, then exit gracefully. This prevents:
- Partial batches that complicate audit logs
- Duplicate notifications if the job retries
- Inconsistent state between database logs and actual FCM deliveries
The worker checks for cancellation at two strategic points:
// service/firebase.service.ts
async sendConditionalNotifications(
jobData: ConditionalNotificationParams
): Promise {
const { jobId, title, content, chunkSize = 500, chunkDelay = 2000 } = jobData;
console.log(`[sendConditionalNotifications] Job ${jobId} starting...`);
let cursor = null;
let totalSent = 0;
let wasCancelled = false;
const allTokens: string[] = [];
// Database pagination loop
while (true) {
// CHECKPOINT 1: Before database fetch
const cancelledDuringFetch = await isCancellationFlagSet(
this.redisService,
jobId
);
if (cancelledDuringFetch) {
console.warn(`[Job ${jobId}] Cancelled during database pagination`);
console.log(`[Job ${jobId}] Tokens fetched so far: ${allTokens.length}`);
wasCancelled = true;
await clearCancellationFlag(this.redisService, jobId);
break;
}
// Fetch next page of recipients
const { tokens, nextCursor } = await this.tokenService.getValidTokens(
cursor,
10000
);
if (tokens.length === 0) break;
allTokens.push(...tokens);
cursor = nextCursor;
if (!cursor) break; // No more pages
}
if (wasCancelled) {
console.log(`[Job ${jobId}] Cancelled before sending. No notifications sent.`);
return false;
}
// Split into chunks for rate-limited sending
const chunks = chunkArray(allTokens, chunkSize);
console.log(`[Job ${jobId}] Sending ${allTokens.length} notifications in ${chunks.length} chunks`);
for (let chunkIndex = 0; chunkIndex < chunks.length; chunkIndex++) {
// CHECKPOINT 2: Before each chunk send
const cancelledBeforeChunk = await isCancellationFlagSet(
this.redisService,
jobId
);
if (cancelledBeforeChunk) {
console.warn(`[Job ${jobId}] Cancelled before chunk ${chunkIndex + 1}/${chunks.length}`);
console.log(`[Job ${jobId}] Already sent: ${totalSent}/${allTokens.length} notifications`);
wasCancelled = true;
await clearCancellationFlag(this.redisService, jobId);
break;
}
const chunk = chunks[chunkIndex];
try {
// Apply rate limiting delay (except for first chunk)
if (chunkIndex > 0) {
console.log(`[Job ${jobId}] Waiting ${chunkDelay}ms before chunk ${chunkIndex + 1}`);
await delay(chunkDelay);
}
// Send notification batch to Firebase
const response = await this.fcmService.sendBatch(chunk, title, content);
totalSent += response.successCount;
console.log(`[Job ${jobId}] Chunk ${chunkIndex + 1}/${chunks.length} completed: ${response.successCount}/${chunk.length} sent`);
} catch (error) {
console.error(`[Job ${jobId}] Chunk ${chunkIndex + 1} error:`, error);
}
}
// Final summary logging
console.log(`\n========== [${jobId}] Final Results ==========`);
console.log(`Status: ${wasCancelled ? 'CANCELLED' : 'COMPLETED'}`);
console.log(`Total sent: ${totalSent}/${allTokens.length} (${((totalSent / allTokens.length) * 100).toFixed(2)}%)`);
console.log(`Not sent: ${allTokens.length - totalSent} (due to ${wasCancelled ? 'cancellation' : 'errors'})`);
console.log(`=============================================\n`);
return !wasCancelled;
}
Why two checkpoints?
Checkpoint 1 (before database fetch): Prevents wasting database resources when cancellation is requested early. If my team lead cancels while the job is still fetching user records, this checkpoint stops the query loop immediately.
Checkpoint 2 (before each chunk): Catches cancellations during the sending phase. Firebase Cloud Messaging enforces rate limits (600,000 requests/minute), so large campaigns take time. This checkpoint lets me stop between chunks without abandoning in-flight FCM requests.
What would happen now if the YouTube Live incident occurred again
If the same incident happened today with this system in place, here's how it would play out:
12:30:00 - Marketing team discovers mobile URL issue, reports to team lead
12:30:15 - Team lead calls me: "Stop the notification immediately"
12:30:20 - I identify the problematic job ID from logs: conditional-abc-123
12:30:25 - I call the cancellation API via Postman:
DELETE https://api.example.com/message/cancel/conditional-abc-123
12:30:26 - API responds immediately:
{
"status": "cancelling",
"message": "Cancellation requested. Job will stop at next checkpoint."
}
12:30:28 - Worker finishes sending current chunk (batch 121 of 500)
12:30:29 - Worker checks cancellation flag before chunk 122
12:30:30 - Worker exits gracefully and logs final stats:
Status: CANCELLED
Total sent: 12,100/50,000 (24.2%)
Not sent: 37,900 (due to cancellation)
Total elapsed time: 30 seconds from team lead's order to clean stop.
Outcome:
- No container restart required
- Email campaigns continue unaffected
- SMS alerts continue unaffected
- Clean audit log showing exactly which 12,100 users received the broken notification
- Marketing team can target the remaining 37,900 users with corrected URL
- No other team members affected
- Team lead receives immediate confirmation with stats
I can confidently report back to my team lead: "Stopped at 12,100 notifications sent. The remaining 37,900 were not sent. All other services running normally."
Edge cases I learned the hard way
Already-sent notifications can't be recalled
This is fundamental to push notification systems. Once I call admin.messaging().sendEach(), Firebase takes control. Even if my server crashes, even if I cancel the job, those notifications are in FCM's delivery queue and will be delivered.
My cancellation system acknowledges this reality—it stops future sends, but makes no promises about notifications already dispatched to FCM. This is why checkpoint placement matters. I check flags before calling FCM, not after.
When my team lead asks "Can you recall the notifications already sent?", I can now give a clear answer: "No, but I've stopped all future sends. Here's exactly how many were sent before cancellation."
FCM rate limits create natural cancellation windows
Firebase Cloud Messaging limits requests to 600,000 per minute per project. For large campaigns sending 50,000 notifications, I chunk into batches of 500 with 2-second delays between chunks. This means:
- Total chunks: 100
- Total time: ~3.5 minutes
- Cancellation windows: 100 opportunities (before each chunk)
The deliberate pacing for rate limiting creates frequent checkpoints where cancellation can occur cleanly.
Stale cancellation flags waste Redis memory
Without TTL, cancelled job flags accumulate forever. A busy notification service might create dozens of cancellation flags weekly. After 6 months, that's thousands of orphaned keys consuming memory.
The one-hour TTL strikes a balance:
- Long enough for any job to complete (my longest jobs take ~30 minutes)
- Short enough to prevent unbounded growth
- Automatic cleanup without manual intervention
Workers crash mid-execution
BullMQ marks jobs as "stalled" after 30 seconds of inactivity. If a worker crashes during notification sending, the job eventually returns to the queue for retry. My cancellation flag persists in Redis independently, so:
- Worker A starts job, processes 5,000 notifications
- Team lead cancels job, admin sets Redis flag
- Worker A crashes before seeing the flag
- BullMQ marks job as stalled after 30 seconds
- Worker B picks up stalled job
- Worker B immediately checks Redis flag, sees cancellation
- Worker B exits gracefully without sending more
The Redis flag outlives individual worker processes, ensuring cancellation intent persists across retries and crashes.
Testing cancellation logic
I test cancellation with three scenarios that mirror real production failures:
Test 1: Cancel before job starts
describe('Cancellation API', () => {
it('should remove waiting job from queue', async () => {
// Queue a job but don't process yet
const job = await queue.add('send-conditional-notification', {
jobId: 'test-123',
title: 'Test Notification',
content: 'This should be cancelled',
});
// Verify job is in waiting state
expect(await job.getState()).toBe('waiting');
// Cancel immediately via API
const response = await request(app.getHttpServer())
.delete(`/message/cancel/${job.id}`)
.expect(200);
expect(response.body.status).toBe('removed');
// Job should no longer exist in queue
const jobAfterCancel = await queue.getJob(job.id);
expect(jobAfterCancel).toBeNull();
});
});
Test 2: Cancel during execution
it('should gracefully stop active job mid-execution', async () => {
// Mock token service to return 10,000 tokens
jest.spyOn(tokenService, 'getValidTokens').mockResolvedValue({
tokens: Array(10000).fill('fake-token-xxx'),
nextCursor: null,
});
// Start job
const job = await queue.add('send-conditional-notification', {
jobId: 'test-456',
title: 'Long Campaign',
content: 'This will be cancelled mid-send',
chunkSize: 100,
chunkDelay: 100,
});
// Wait for job to become active
await waitForJobState(job, 'active');
// Let it send a few chunks
await delay(500);
// Cancel mid-execution via API
const response = await request(app.getHttpServer())
.delete(`/message/cancel/${job.id}`)
.expect(200);
expect(response.body.status).toBe('cancelling');
// Wait for job to complete
const result = await job.waitUntilFinished(queueEvents);
// Should have sent some but not all
expect(result.sent).toBeGreaterThan(0);
expect(result.sent).toBeLessThan(10000);
expect(result.status).toBe('cancelled');
});
Test 3: Race condition—cancel during chunk send
it('should complete in-flight chunk before cancelling', async () => {
// Mock FCM to take 3 seconds per chunk
jest.spyOn(fcmService, 'sendBatch')
.mockImplementation(() =>
delay(3000).then(() => ({
successCount: 100,
failureCount: 0,
}))
);
const job = await queue.add('send-conditional-notification', {
jobId: 'test-789',
title: 'Race Condition Test',
content: 'Testing chunk completion',
chunkSize: 100,
});
await waitForJobState(job, 'active');
// Cancel 1.5 seconds into chunk send (mid-flight)
await delay(1500);
await request(app.getHttpServer())
.delete(`/message/cancel/${job.id}`)
.expect(200);
const result = await job.waitUntilFinished(queueEvents);
// Should have completed at least the in-flight chunk
expect(result.sent).toBeGreaterThanOrEqual(100);
// But should not have started the next chunk
expect(result.sent).toBeLessThan(200);
});
These tests validate the core guarantees:
- Queued jobs are removed instantly
- Active jobs stop at the next checkpoint
- In-flight chunks complete before stopping
Production monitoring essentials
I track four key metrics in Datadog so I can report status to my team lead immediately:
1. Cancellation response time
- Definition: Time between API request and worker acknowledgment
- Target: <5 seconds
- Alert: >30 seconds (indicates checkpoint placement issues)
-
Query:
max:bullmq.job.cancellation_latency{env:prod}
2. Partial completion rate
- Definition: Percentage of notifications sent before cancellation
- Target: <50% (cancelling early enough to matter)
- Alert: Consistently >90% (cancelling too late)
-
Query:
avg:bullmq.job.cancelled_percentage{env:prod}
3. Stale flag cleanup
- Definition: Count of active cancellation flags in Redis
- Target: <100 keys at any time
- Alert: >1,000 keys (TTL not working)
-
Query:
redis.keys.count{key:push:cancel:*,env:prod}
4. False cancellations
- Definition: Jobs marked cancelled but no Redis flag found
- Target: 0 occurrences
- Alert: >1 per day (indicates race conditions)
-
Query:
sum:bullmq.job.cancellation_mismatch{env:prod}.as_count()
When Docker restart is still the answer
API-based cancellation isn't always appropriate. I still use container restarts for:
Code deployment - Updating application logic requires process restart. Blue-green deployment handles this gracefully (covered in my other blog post).
Memory leaks - When worker memory consumption grows unbounded (e.g., 4GB+ on a 8GB instance), restart is the only option. I monitor with container.memory.usage alerts.
Redis connection failures - When the coordination layer itself fails, workers can't check cancellation flags anyway. Container restart re-establishes connections.
Database migration rollbacks - Breaking schema changes sometimes require immediate worker shutdown to prevent data corruption.
But for operational mistakes—wrong targeting, wrong content, wrong URLs, wrong anything—graceful cancellation prevents collateral damage while maintaining system stability. My team lead can now make the call to cancel without worrying about breaking other services.
Key takeaways
Building this cancellation system taught me that distributed systems need coordination primitives, not just command-and-control. Redis flags let API servers and workers collaborate across process boundaries with sub-millisecond overhead.
The two-tier cancellation strategy—immediate removal for queued jobs, graceful completion for active ones—acknowledges that some work can be prevented while other work can only be stopped cleanly at checkpoint boundaries.
Most importantly, cancellation in distributed systems isn't about rollback—it's about damage control. I can't unsend notifications already dispatched to Firebase, but I can:
- Stop wasting resources on doomed work
- Provide clean audit trails showing exactly what completed
- Maintain stability for unrelated background jobs
- Give my team lead immediate, accurate status updates
- Sleep better knowing I have an emergency brake that actually works
If I could go back to that stressful YouTube Live incident, this system would have:
- Let my team lead stop the campaign with a single API call
- Saved me from SSH'ing into production and killing containers
- Prevented collateral damage to other services
- Given the marketing team clean data showing exactly who received the broken notification
- Eliminated hours of debugging and service recovery
Resources:
Top comments (0)