Sangwoo Lee

Posted on Dec 1

Stop That Notification! How I Built Graceful Cancellation for Mass Push Campaigns

#node #redis #firebase #shutdown

My hands were shaking as I watched the notification counter climb: 5,000 sent... 8,000 sent... 12,000 sent. I had just launched a push notification to every user in the database with a YouTube Live stream URL. The problem? The URL worked perfectly on desktop web, but mobile users were seeing error screens instead of the livestream.

The Planning Dept. had sent me the wrong URL format—one that wasn't mobile-compatible. Within minutes, my team lead called: "Stop the notification immediately. Mobile users can't access the stream."

I had about 90 seconds to make a decision. My only option at the time? Force-stop the Docker container and pray.

That terrifying moment taught me something crucial: when you're sending mass notifications, you need an elegant cancellation mechanism, not emergency infrastructure shutdowns. In this post, I'll show you how I built a graceful cancellation system using Redis coordination flags that lets me stop notification campaigns mid-flight without destroying my entire worker infrastructure.

The YouTube Live incident

Here's what happened that day. The marketing team planned a major YouTube Live event and needed to notify all app users. They requested a push notification with the livestream link. I received the URL from Planning Dept., tested it quickly on my desktop browser—worked fine—and queued the notification job for all 50,000+ active users.

The notification sending started smoothly. Then, about three minutes in, my team lead called urgently. The marketing team had just discovered that the YouTube URL format didn't work on mobile devices. Desktop users could watch the stream perfectly, but mobile app users kept getting redirected to the YouTube channel page instead of live stream page.

The longer format didn't properly deep-link into the YouTube mobile app. By the time my team lead gave the order to stop, 12,000 notifications had already been sent. I had no choice—I needed to stop the remaining 38,000 notifications from going out.

My only option: emergency Docker shutdown.

I SSH'd into the EC2 instance and ran:

docker stop messaging-worker

The aftermath was messy:

Some users received the broken notification, others didn't
I had no clean record of who got what
The marketing team had to manually coordinate a follow-up campaign with the correct URL
Other legitimate background jobs were interrupted
I spent the next hour restarting services and explaining what happened

I knew this couldn't happen again. There had to be a better way.

The real problem with Docker restarts

When I force-stopped that container, I created cascading failures throughout my infrastructure:

Active jobs get abandoned mid-execution. The notification worker was processing batch 120 of 500 when I killed it. That progress vanished completely. BullMQ marked the job as "stalled" and wouldn't retry for an hour by default. Meanwhile, I had no audit trail showing which 12,000 users received the broken notification—critical information for the follow-up campaign.

Other legitimate jobs die too. My notification service handles more than just push notifications. It processes:

Scheduled email campaigns (abandoned mid-send)
SMS verification codes (users waiting for login codes that never arrived)
In-app message distribution (promotional content stuck in limbo)
Weekly digest emails (interrupted halfway through user segments)

Killing the container stopped everything, not just the problematic YouTube notification campaign.

State preservation becomes impossible. Without graceful shutdown, I couldn't:

Save checkpoints showing partial completion
Update progress counters for monitoring
Log which specific recipients received notifications
Mark the job as "cancelled by admin" vs "failed due to error"

Debugging became archaeology—I spent three hours reconstructing what happened by correlating timestamps across application logs, FCM delivery receipts, and database activity logs.

The stress was unbearable. For 20 minutes after stopping the container, I didn't know if I'd made things worse. Had I corrupted the Redis queue? Would the job retry and send duplicate notifications? Were other services impacted? My team lead was waiting for updates, and I had no clear answers.

The solution I built: API-driven cancellation

After that incident, I made it my priority to build a proper cancellation system. The requirements were clear:

Stop problematic campaigns immediately - No waiting for jobs to finish naturally
Zero collateral damage - Other background jobs must continue unaffected
Clean audit trails - Know exactly who received notifications before cancellation
Team lead can trigger it - Simple API call, no SSH access needed

The architecture I designed uses Redis as a coordination layer between my API servers and worker processes.

The new workflow:

Team lead (or any admin) calls DELETE /message/cancel/{jobId} from any API client
The API sets a cancellation flag in Redis with a TTL
Workers check this flag at strategic checkpoints during execution
When detected, workers exit gracefully and log exactly what was completed
Team lead receives immediate confirmation with partial completion stats

This decouples cancellation (an API operation) from job execution (a worker operation), allowing clean coordination across process boundaries. No SSH required, no container restarts, no collateral damage.

Architecture: Redis as the coordination layer

My stack consists of NestJS API servers that enqueue jobs, BullMQ workers that process them, and Redis as both the queue backend and coordination layer. The cancellation system uses three simple Redis operations:

// utils/cancellation-flag.ts
import { RedisService } from 'src/services/redis.service';

export async function setCancellationFlag(
  redisService: RedisService,
  jobId: string,
  ttlSeconds: number = 3600
): Promise {
  const key = `push:cancel:${jobId}`;
  await redisService.set(key, '1', ttlSeconds);
  console.log(`[setCancellationFlag] Job ${jobId} cancellation flag set (TTL: ${ttlSeconds}s)`);
}

export async function isCancellationFlagSet(
  redisService: RedisService,
  jobId: string
): Promise {
  const key = `push:cancel:${jobId}`;
  const value = await redisService.get(key);
  return value === '1';
}

export async function clearCancellationFlag(
  redisService: RedisService,
  jobId: string
): Promise {
  const key = `push:cancel:${jobId}`;
  await redisService.del(key);
  console.log(`[clearCancellationFlag] Job ${jobId} cancellation flag cleared`);
}

The one-hour TTL is deliberate. Notification jobs typically complete within 30 minutes, so an hour provides safety margin without cluttering Redis with permanent flags. If a job completes normally, the flag expires harmlessly after an hour.

Why Redis instead of database flags? Three reasons:

Latency - Redis GET operations take <1ms vs 10-50ms for database queries
Frequency - Workers check cancellation flags every few seconds during long jobs
TTL support - Automatic cleanup without cron jobs or manual deletion

Two cancellation scenarios that matter

Not all jobs are equal when it comes to cancellation. My implementation handles two distinct scenarios:

Scenario 1: Queued jobs (immediate removal)

Jobs waiting in the delayed or waiting states haven't started processing yet. These can be removed from the queue entirely with no side effects:

// controller/firebase.controller.ts
@Delete('cancel/:jobId')
async cancelNotification(@Param('jobId') jobId: string) {
  const job = await this.pushQueue.getJob(jobId);

  if (!job) {
    return {
      status: 'not_found',
      message: 'Job not found in queue',
    };
  }

  const state = await job.getState();
  console.log(`[cancelNotification] Job ${jobId} state: ${state}`);

  // Immediate removal for queued jobs
  if (state === 'waiting' || state === 'delayed') {
    await job.remove();
    console.log(`[cancelNotification] Job ${jobId} removed from queue (state: ${state})`);
    return {
      status: 'removed',
      message: 'Job removed from queue before processing started',
    };
  }

  // Graceful cancellation for active jobs
  if (state === 'active') {
    await setCancellationFlag(this.redisService, jobId, 3600);
    console.log(`[cancelNotification] Job ${jobId} cancellation flag set`);
    return {
      status: 'cancelling',
      message: 'Cancellation requested. Job will stop at next checkpoint.',
    };
  }

  // Already completed jobs
  if (state === 'completed') {
    return {
      status: 'already_completed',
      message: 'Job already finished, cannot cancel',
    };
  }

  // Failed jobs
  if (state === 'failed') {
    return {
      status: 'already_failed',
      message: 'Job already failed, no cancellation needed',
    };
  }

  throw new BadRequestException(`Cannot cancel job in state: ${state}`);
}

This approach prevents unnecessary work. Why let a doomed job start executing when I can pull it from the queue entirely?

Real-world impact: In the YouTube Live incident, if the system had been in place, my team lead could have cancelled the job after 12,000 sends while it was queued for the next batch. No container restart, no collateral damage, clean audit trail.

Scenario 2: Active jobs (graceful completion)

Active jobs require nuance. A worker might be halfway through sending 10,000 notifications when the cancellation request arrives. Here's the critical constraint: I cannot recall notifications already sent to Firebase Cloud Messaging. Once FCM receives the request, it's in flight and cannot be cancelled.

My solution: let the current chunk complete, then exit gracefully. This prevents:

Partial batches that complicate audit logs
Duplicate notifications if the job retries
Inconsistent state between database logs and actual FCM deliveries

The worker checks for cancellation at two strategic points:

// service/firebase.service.ts
async sendConditionalNotifications(
  jobData: ConditionalNotificationParams
): Promise {
  const { jobId, title, content, chunkSize = 500, chunkDelay = 2000 } = jobData;

  console.log(`[sendConditionalNotifications] Job ${jobId} starting...`);

  let cursor = null;
  let totalSent = 0;
  let wasCancelled = false;
  const allTokens: string[] = [];

  // Database pagination loop
  while (true) {
    // CHECKPOINT 1: Before database fetch
    const cancelledDuringFetch = await isCancellationFlagSet(
      this.redisService, 
      jobId
    );

    if (cancelledDuringFetch) {
      console.warn(`[Job ${jobId}] Cancelled during database pagination`);
      console.log(`[Job ${jobId}] Tokens fetched so far: ${allTokens.length}`);
      wasCancelled = true;
      await clearCancellationFlag(this.redisService, jobId);
      break;
    }

    // Fetch next page of recipients
    const { tokens, nextCursor } = await this.tokenService.getValidTokens(
      cursor, 
      10000
    );

    if (tokens.length === 0) break;

    allTokens.push(...tokens);
    cursor = nextCursor;

    if (!cursor) break; // No more pages
  }

  if (wasCancelled) {
    console.log(`[Job ${jobId}] Cancelled before sending. No notifications sent.`);
    return false;
  }

  // Split into chunks for rate-limited sending
  const chunks = chunkArray(allTokens, chunkSize);
  console.log(`[Job ${jobId}] Sending ${allTokens.length} notifications in ${chunks.length} chunks`);

  for (let chunkIndex = 0; chunkIndex < chunks.length; chunkIndex++) {
    // CHECKPOINT 2: Before each chunk send
    const cancelledBeforeChunk = await isCancellationFlagSet(
      this.redisService, 
      jobId
    );

    if (cancelledBeforeChunk) {
      console.warn(`[Job ${jobId}] Cancelled before chunk ${chunkIndex + 1}/${chunks.length}`);
      console.log(`[Job ${jobId}] Already sent: ${totalSent}/${allTokens.length} notifications`);

      wasCancelled = true;
      await clearCancellationFlag(this.redisService, jobId);
      break;
    }

    const chunk = chunks[chunkIndex];

    try {
      // Apply rate limiting delay (except for first chunk)
      if (chunkIndex > 0) {
        console.log(`[Job ${jobId}] Waiting ${chunkDelay}ms before chunk ${chunkIndex + 1}`);
        await delay(chunkDelay);
      }

      // Send notification batch to Firebase
      const response = await this.fcmService.sendBatch(chunk, title, content);
      totalSent += response.successCount;

      console.log(`[Job ${jobId}] Chunk ${chunkIndex + 1}/${chunks.length} completed: ${response.successCount}/${chunk.length} sent`);

    } catch (error) {
      console.error(`[Job ${jobId}] Chunk ${chunkIndex + 1} error:`, error);
    }
  }

  // Final summary logging
  console.log(`\n========== [${jobId}] Final Results ==========`);
  console.log(`Status: ${wasCancelled ? 'CANCELLED' : 'COMPLETED'}`);
  console.log(`Total sent: ${totalSent}/${allTokens.length} (${((totalSent / allTokens.length) * 100).toFixed(2)}%)`);
  console.log(`Not sent: ${allTokens.length - totalSent} (due to ${wasCancelled ? 'cancellation' : 'errors'})`);
  console.log(`=============================================\n`);

  return !wasCancelled;
}

Why two checkpoints?

Checkpoint 1 (before database fetch): Prevents wasting database resources when cancellation is requested early. If my team lead cancels while the job is still fetching user records, this checkpoint stops the query loop immediately.

Checkpoint 2 (before each chunk): Catches cancellations during the sending phase. Firebase Cloud Messaging enforces rate limits (600,000 requests/minute), so large campaigns take time. This checkpoint lets me stop between chunks without abandoning in-flight FCM requests.

What would happen now if the YouTube Live incident occurred again

If the same incident happened today with this system in place, here's how it would play out:

12:30:00 - Marketing team discovers mobile URL issue, reports to team lead

12:30:15 - Team lead calls me: "Stop the notification immediately"

12:30:20 - I identify the problematic job ID from logs: conditional-abc-123

12:30:25 - I call the cancellation API via Postman:

DELETE https://api.example.com/message/cancel/conditional-abc-123

12:30:26 - API responds immediately:

{
  "status": "cancelling",
  "message": "Cancellation requested. Job will stop at next checkpoint."
}

12:30:28 - Worker finishes sending current chunk (batch 121 of 500)

12:30:29 - Worker checks cancellation flag before chunk 122

12:30:30 - Worker exits gracefully and logs final stats:

Status: CANCELLED
Total sent: 12,100/50,000 (24.2%)
Not sent: 37,900 (due to cancellation)

Total elapsed time: 30 seconds from team lead's order to clean stop.

Outcome:

No container restart required
Email campaigns continue unaffected
SMS alerts continue unaffected
Clean audit log showing exactly which 12,100 users received the broken notification
Marketing team can target the remaining 37,900 users with corrected URL
No other team members affected
Team lead receives immediate confirmation with stats

I can confidently report back to my team lead: "Stopped at 12,100 notifications sent. The remaining 37,900 were not sent. All other services running normally."

Edge cases I learned the hard way

Already-sent notifications can't be recalled

This is fundamental to push notification systems. Once I call admin.messaging().sendEach(), Firebase takes control. Even if my server crashes, even if I cancel the job, those notifications are in FCM's delivery queue and will be delivered.

My cancellation system acknowledges this reality—it stops future sends, but makes no promises about notifications already dispatched to FCM. This is why checkpoint placement matters. I check flags before calling FCM, not after.

When my team lead asks "Can you recall the notifications already sent?", I can now give a clear answer: "No, but I've stopped all future sends. Here's exactly how many were sent before cancellation."

FCM rate limits create natural cancellation windows

Firebase Cloud Messaging limits requests to 600,000 per minute per project. For large campaigns sending 50,000 notifications, I chunk into batches of 500 with 2-second delays between chunks. This means:

Total chunks: 100
Total time: ~3.5 minutes
Cancellation windows: 100 opportunities (before each chunk)

The deliberate pacing for rate limiting creates frequent checkpoints where cancellation can occur cleanly.

Stale cancellation flags waste Redis memory

Without TTL, cancelled job flags accumulate forever. A busy notification service might create dozens of cancellation flags weekly. After 6 months, that's thousands of orphaned keys consuming memory.

The one-hour TTL strikes a balance:

Long enough for any job to complete (my longest jobs take ~30 minutes)
Short enough to prevent unbounded growth
Automatic cleanup without manual intervention

Workers crash mid-execution

BullMQ marks jobs as "stalled" after 30 seconds of inactivity. If a worker crashes during notification sending, the job eventually returns to the queue for retry. My cancellation flag persists in Redis independently, so:

Worker A starts job, processes 5,000 notifications
Team lead cancels job, admin sets Redis flag
Worker A crashes before seeing the flag
BullMQ marks job as stalled after 30 seconds
Worker B picks up stalled job
Worker B immediately checks Redis flag, sees cancellation
Worker B exits gracefully without sending more

The Redis flag outlives individual worker processes, ensuring cancellation intent persists across retries and crashes.

Testing cancellation logic

I test cancellation with three scenarios that mirror real production failures:

Test 1: Cancel before job starts

describe('Cancellation API', () => {
  it('should remove waiting job from queue', async () => {
    // Queue a job but don't process yet
    const job = await queue.add('send-conditional-notification', {
      jobId: 'test-123',
      title: 'Test Notification',
      content: 'This should be cancelled',
    });

    // Verify job is in waiting state
    expect(await job.getState()).toBe('waiting');

    // Cancel immediately via API
    const response = await request(app.getHttpServer())
      .delete(`/message/cancel/${job.id}`)
      .expect(200);

    expect(response.body.status).toBe('removed');

    // Job should no longer exist in queue
    const jobAfterCancel = await queue.getJob(job.id);
    expect(jobAfterCancel).toBeNull();
  });
});

Test 2: Cancel during execution

it('should gracefully stop active job mid-execution', async () => {
  // Mock token service to return 10,000 tokens
  jest.spyOn(tokenService, 'getValidTokens').mockResolvedValue({
    tokens: Array(10000).fill('fake-token-xxx'),
    nextCursor: null,
  });

  // Start job
  const job = await queue.add('send-conditional-notification', {
    jobId: 'test-456',
    title: 'Long Campaign',
    content: 'This will be cancelled mid-send',
    chunkSize: 100,
    chunkDelay: 100,
  });

  // Wait for job to become active
  await waitForJobState(job, 'active');

  // Let it send a few chunks
  await delay(500);

  // Cancel mid-execution via API
  const response = await request(app.getHttpServer())
    .delete(`/message/cancel/${job.id}`)
    .expect(200);

  expect(response.body.status).toBe('cancelling');

  // Wait for job to complete
  const result = await job.waitUntilFinished(queueEvents);

  // Should have sent some but not all
  expect(result.sent).toBeGreaterThan(0);
  expect(result.sent).toBeLessThan(10000);
  expect(result.status).toBe('cancelled');
});

Test 3: Race condition—cancel during chunk send

it('should complete in-flight chunk before cancelling', async () => {
  // Mock FCM to take 3 seconds per chunk
  jest.spyOn(fcmService, 'sendBatch')
    .mockImplementation(() => 
      delay(3000).then(() => ({
        successCount: 100,
        failureCount: 0,
      }))
    );

  const job = await queue.add('send-conditional-notification', {
    jobId: 'test-789',
    title: 'Race Condition Test',
    content: 'Testing chunk completion',
    chunkSize: 100,
  });

  await waitForJobState(job, 'active');

  // Cancel 1.5 seconds into chunk send (mid-flight)
  await delay(1500);
  await request(app.getHttpServer())
    .delete(`/message/cancel/${job.id}`)
    .expect(200);

  const result = await job.waitUntilFinished(queueEvents);

  // Should have completed at least the in-flight chunk
  expect(result.sent).toBeGreaterThanOrEqual(100);
  // But should not have started the next chunk
  expect(result.sent).toBeLessThan(200);
});

These tests validate the core guarantees:

Queued jobs are removed instantly
Active jobs stop at the next checkpoint
In-flight chunks complete before stopping

Production monitoring essentials

I track four key metrics in Datadog so I can report status to my team lead immediately:

1. Cancellation response time

Definition: Time between API request and worker acknowledgment
Target: <5 seconds
Alert: >30 seconds (indicates checkpoint placement issues)
Query: max:bullmq.job.cancellation_latency{env:prod}

2. Partial completion rate

Definition: Percentage of notifications sent before cancellation
Target: <50% (cancelling early enough to matter)
Alert: Consistently >90% (cancelling too late)
Query: avg:bullmq.job.cancelled_percentage{env:prod}

3. Stale flag cleanup

Definition: Count of active cancellation flags in Redis
Target: <100 keys at any time
Alert: >1,000 keys (TTL not working)
Query: redis.keys.count{key:push:cancel:*,env:prod}

4. False cancellations

Definition: Jobs marked cancelled but no Redis flag found
Target: 0 occurrences
Alert: >1 per day (indicates race conditions)
Query: sum:bullmq.job.cancellation_mismatch{env:prod}.as_count()

When Docker restart is still the answer

API-based cancellation isn't always appropriate. I still use container restarts for:

Code deployment - Updating application logic requires process restart. Blue-green deployment handles this gracefully (covered in my other blog post).

Memory leaks - When worker memory consumption grows unbounded (e.g., 4GB+ on a 8GB instance), restart is the only option. I monitor with container.memory.usage alerts.

Redis connection failures - When the coordination layer itself fails, workers can't check cancellation flags anyway. Container restart re-establishes connections.

Database migration rollbacks - Breaking schema changes sometimes require immediate worker shutdown to prevent data corruption.

But for operational mistakes—wrong targeting, wrong content, wrong URLs, wrong anything—graceful cancellation prevents collateral damage while maintaining system stability. My team lead can now make the call to cancel without worrying about breaking other services.

Key takeaways

Building this cancellation system taught me that distributed systems need coordination primitives, not just command-and-control. Redis flags let API servers and workers collaborate across process boundaries with sub-millisecond overhead.

The two-tier cancellation strategy—immediate removal for queued jobs, graceful completion for active ones—acknowledges that some work can be prevented while other work can only be stopped cleanly at checkpoint boundaries.

Most importantly, cancellation in distributed systems isn't about rollback—it's about damage control. I can't unsend notifications already dispatched to Firebase, but I can:

Stop wasting resources on doomed work
Provide clean audit trails showing exactly what completed
Maintain stability for unrelated background jobs
Give my team lead immediate, accurate status updates
Sleep better knowing I have an emergency brake that actually works

If I could go back to that stressful YouTube Live incident, this system would have:

Let my team lead stop the campaign with a single API call
Saved me from SSH'ing into production and killing containers
Prevented collateral damage to other services
Given the marketing team clean data showing exactly who received the broken notification
Eliminated hours of debugging and service recovery

Resources:

DEV Community