Sangwoo Lee

Posted on Dec 31, 2025

Building Real-time Progress Tracking for Mass Push Notifications with Redis and Polling

#redis #polling #nestjs #bullmq

"Can we see the progress bar while notifications are being sent?"

My product manager asked this during a sprint review. We were sending push notifications to 500,000+ users, and the process took over 80 minutes. During that time, the admin dashboard just showed "Sending..." with no visibility into:

How many notifications were actually sent
Current success/failure rates
Estimated time remaining
Whether the job was stuck or progressing normally

The challenge: My workers run in separate Docker containers on different EC2 instances. The frontend can only talk to API servers—it has no direct connection to workers. How do I track progress across this distributed system in real-time?

In this post, I'll show you how I built a polling-based progress tracking system using Redis as a real-time cache, with database fallbacks for reliability.

The problem: distributed workers, centralized monitoring

My push notification architecture:

NestJS API servers (3 instances behind load balancer)
BullMQ workers (2 instances for blue-green deployment)
Redis for job queue and coordination
MSSQL for notification logs (one row per recipient)
MySQL for schedule metadata

A typical campaign flow:

Admin creates a campaign targeting 500,000 users
API server queues a BullMQ job with unique jobId
Worker picks up the job and starts sending in chunks of 500
Worker adds 2-second delays between chunks (FCM rate limits: 600K/minute)
Total time: ~80 minutes for 500,000 recipients

The math:

500,000 users ÷ 500 per chunk = 1,000 chunks
1,000 chunks × 2 seconds delay = 2,000 seconds (~33 minutes of pure waiting)
+ Actual FCM API calls (~47 minutes)
= ~80 minutes total

The problem? The frontend has no idea what's happening between step 3 and step 5.

Solution: Redis as a real-time progress cache

I needed a centralized progress store that:

✅ Writes fast - Workers update every 2 seconds (per chunk)
✅ Reads fast - Frontend polls every 2-5 seconds
✅ Auto-expires - Completed jobs clean up automatically
✅ Survives crashes - Progress persists if workers restart
✅ Low overhead - Minimal impact on worker performance

Why not just query the database?

I could count rows in the notification logs table, but this has serious problems:

Write amplification - 500,000 INSERT operations during sending
Query load - Frontend polling creates constant SELECT COUNT(*) queries
Lock contention - COUNT(*) locks table during active writes
No estimated time - Can't calculate completion time from static logs

Redis solves all of these:

// redis.service.ts
private readonly PROGRESS_KEY_PREFIX = 'job:progress:';
private readonly PROGRESS_TTL_SECONDS = 3600; // 1 hour

async setProgressByJobId(params: JobProgressUpdateParams): Promise {
  const key = `${this.PROGRESS_KEY_PREFIX}${params.jobId}`;

  const progressPercentage = params.totalTargetCount > 0
    ? Math.round((params.currentSentCount / params.totalTargetCount) * 100)
    : 0;

  const progress: ScheduleProgress = {
    jobId: params.jobId,
    status: params.status, // 'preparing', 'in_progress', 'completed', 'failed'
    currentSentCount: params.currentSentCount,
    totalTargetCount: params.totalTargetCount,
    progressPercentage,
    estimatedEndAt: this.calculateEstimatedEndTime(params),
    // ... more fields
  };

  // TTL: 1 hour for completed, 24 hours for in-progress
  const ttl = params.status === 'completed' ? 3600 : 86400;
  await this.client.setex(key, ttl, JSON.stringify(progress));
}

Key design decisions:

1. TTL for automatic cleanup

Completed jobs: 1-hour TTL (users rarely check after completion)
In-progress jobs: 24-hour TTL (safety net for very slow campaigns)
No cron jobs needed—Redis handles cleanup

2. Separate keys per job

Pattern: job:progress:{jobId}
Individual queries: GET job:progress:conditional-abc-123
Batch queries: KEYS job:progress:* (for dashboard)

The worker: updating progress during execution

Workers update Redis after completing each chunk:

// firebase.service.ts - Simplified core loop
async sendConditionalNotifications(jobData): Promise {
  const chunks = chunkArray(tokens, 500); // 1,000 chunks for 500K users

  // Initialize progress
  await this.redisService.setProgressByJobId({
    jobId,
    status: 'in_progress',
    totalTargetCount: tokens.length,
    currentSentCount: 0,
    totalChunks: chunks.length,
    currentChunk: 0,
    startedAt: new Date(),
  });

  // Process each chunk
  for (let i = 0; i < chunks.length; i++) {
    const response = await this.fcmService.sendBatch(chunks[i], title, content);

    totalSent += response.successCount;
    totalFailed += response.failureCount;

    // Update Redis every chunk (every 2 seconds)
    await this.redisService.setProgressByJobId({
      jobId,
      status: 'in_progress',
      currentSentCount: totalSent + totalFailed,
      totalChunks: chunks.length,
      currentChunk: i + 1,
      // ... other fields
    });

    await delay(2000); // FCM rate limiting
  }

  // Mark completed
  await this.redisService.setProgressByJobId({
    jobId,
    status: 'completed',
    completedAt: new Date(),
  });

  return true;
}

Why update every chunk?

Users expect near-instant feedback
Redis writes are cheap (<1ms)
For 500K users (1,000 chunks), total overhead is ~1 second over 80 minutes

Three polling APIs for different use cases

1. Single job progress (campaign detail page)

// firebase.controller.ts
@Get('conditional-sends/:jobId/progress')
async getConditionalSendProgress(@Param('jobId') jobId: string) {
  const progress = await this.firebaseService.getConditionalSendProgress(jobId);

  if (!progress) {
    return {
      statusCode: 404,
      message: 'Progress not found. Job may have completed over 1 hour ago.',
      data: null,
    };
  }

  return {
    statusCode: 200,
    message: 'Progress retrieved successfully',
    data: progress,
  };
}

Frontend polling:

// React example
useEffect(() => {
  const interval = setInterval(async () => {
    const response = await fetch(`/api/conditional-sends/${jobId}/progress`);
    const { data } = await response.json();

    if (data) {
      setProgress(data);
      if (data.status === 'completed') clearInterval(interval);
    }
  }, 3000); // Poll every 3 seconds

  return () => clearInterval(interval);
}, [jobId]);

2. All active jobs (dashboard overview)

@Get('conditional-sends/active-progress')
async getAllActiveConditionalSendProgress() {
  const progressList = await this.firebaseService.getAllActiveConditionalSendProgress();

  return {
    statusCode: 200,
    message: progressList.length > 0 
      ? `${progressList.length} active campaigns found`
      : 'No active campaigns',
    data: progressList,
  };
}

// redis.service.ts
async getAllActiveJobProgress(): Promise {
  const keys = await this.client.keys('job:progress:*');
  const results: ScheduleProgress[] = [];

  for (const key of keys) {
    const data = await this.client.get(key);
    if (data) {
      const progress = JSON.parse(data);
      if (progress.status === 'in_progress') {
        results.push(progress);
      }
    }
  }

  return results;
}

3. Database fallback (scheduled sends)

For scheduled campaigns, I added a fallback for when Redis is down:

// scheduler.service.ts
async getAllActiveScheduleProgress(): Promise {
  // Try Redis first
  const progressList = await this.redisService.getAllActiveScheduleProgress();
  if (progressList.length > 0) return progressList;

  // Fallback to database
  const sendingSchedules = await this.scheduleRepository
    .createQueryBuilder('schedule')
    .where('schedule.sent_yn = 1') // Sending started
    .andWhere('schedule.actual_send_start_date IS NOT NULL')
    .andWhere('schedule.actual_send_end_date IS NULL') // Not finished
    .getMany();

  // Convert DB records to progress format (without real-time counts)
  return sendingSchedules.map(schedule => ({
    scheduleSeq: schedule.seq,
    status: 'in_progress',
    totalTargetCount: schedule.total_send_count || 0,
    currentSentCount: 0, // Real-time unavailable
    startedAt: schedule.actual_send_start_date?.toISOString(),
    // ...
  }));
}

Estimating completion time accurately

The most requested feature: "When will this finish?"

For a 500K-user campaign taking 80 minutes, accurate estimates are critical:

// redis.service.ts
private calculateJobEstimatedEndTime(params): string | null {
  if (!params.startedAt || params.currentSentCount >= params.totalTargetCount) {
    return null;
  }

  const startTime = new Date(params.startedAt).getTime();
  const elapsedMs = Date.now() - startTime;

  // Current sending rate
  const currentRate = params.currentSentCount / elapsedMs;

  // Remaining work
  const remainingCount = params.totalTargetCount - params.currentSentCount;
  const remainingChunks = params.totalChunks - params.currentChunk;

  // Chunk delay overhead (2 seconds per chunk)
  const chunkDelayMs = remainingChunks > 0 ? (remainingChunks - 1) * 2000 : 0;

  // Pure sending time
  const pureSendingTimeMs = remainingCount / currentRate;

  // Network overhead (5% buffer)
  const networkOverheadMs = pureSendingTimeMs * 0.05;

  // Total estimate
  const totalRemainingMs = pureSendingTimeMs + chunkDelayMs + networkOverheadMs;
  const estimatedEndAt = new Date(Date.now() + totalRemainingMs);

  return estimatedEndAt.toISOString();
}

Why this works for 80-minute jobs:

Example at 40% completion (chunk 400/1,000):

Elapsed: 32 minutes
Current rate: 200,000 sent / 32 min = 6,250 msg/min
Remaining: 300,000 messages
Pure sending: 300,000 / 6,250 = 48 minutes
Chunk delays: 600 chunks × 2 sec = 20 minutes
Network overhead: 48 min × 5% = 2.4 minutes
Total remaining: ~70 minutes
ETA: 32 + 70 = 102 minutes (accurate within ±10 min)

The estimate adjusts in real-time—if the sending rate slows down due to network issues, the ETA shifts accordingly.

Handling edge cases

Worker crashes mid-send

BullMQ's event handlers update Redis on failure:

// firebase.processor.ts
this.worker.on('failed', async (job, error) => {
  const progress = await this.redisService.getProgressByJobId(job.data.jobId);

  await this.redisService.setProgressByJobId({
    ...progress,
    status: 'failed',
    completedAt: new Date().toISOString(),
    errorMessage: error.message,
  });
});

Frontend sees "Failed" status with partial progress, not phantom "in progress" forever.

Redis restart during send

Workers re-initialize progress on restart:

Worker continues from last BullMQ checkpoint
Re-creates Redis progress entry
Frontend sees brief gap (5-10 seconds), then progress resumes

Polling storm (100+ admins watching)

100 clients polling every 3 seconds = 33 req/sec to Redis.

Redis handles this easily (can do 100K+ ops/sec), but I can add API-level caching:

// Future optimization: 1-second in-memory cache
const cache = new Map();

async getProgress(jobId: string) {
  const cached = cache.get(jobId);
  if (cached && Date.now() < cached.expiresAt) return cached.data;

  const progress = await this.redisService.getProgressByJobId(jobId);
  cache.set(jobId, { data: progress, expiresAt: Date.now() + 1000 });

  return progress;
}

Reduces Redis load by 66% (3-sec poll / 1-sec cache = 3x reduction).

Production metrics and observations

After running this system in production for 3 months:

Performance:

Redis latency: <2ms (p99)
API endpoint latency: 15ms (p50), 45ms (p99)
Memory per job: ~500KB (~50MB for 100 concurrent jobs)

Reliability:

Uptime: 99.97% (1 Redis restart in 3 months)
DB fallback usage: 0.03% of requests
Stale data incidents: 0 (TTL works perfectly)

Real campaign example (500K users):

Start: 2025-01-10 09:00:00
Chunks: 1,000 (500 users each)
Progress updates: 1,000 Redis writes over 80 minutes
Frontend polls: ~1,600 requests (80 min ÷ 3 sec)
Total Redis ops: ~2,600 (writes + reads)
Redis CPU usage: <2%

Cost:

ElastiCache t3.micro ($15/month) handles 50 concurrent large campaigns easily
No database load increase—polling doesn't touch MSSQL

Why not Server-Sent Events (SSE)?

I implemented SSE (it's in the codebase) but didn't deploy it because:

1. Proxy/firewall issues

Corporate networks block long-lived HTTP connections
Users complained: "Progress doesn't update on office WiFi"

2. Load balancer complexity

AWS ALB has 30-second timeout by default
Requires sticky sessions or custom routing

3. Debugging is harder

Polling: Check Network tab, see every request/response
SSE: One opaque stream, harder to inspect

4. Reconnection logic

SSE connections drop on network blips
Frontend needs complex retry handling
Polling auto-recovers (next request succeeds)

The trade-off:

Polling overhead: ~3 requests/minute/user (negligible for Redis)
SSE benefit: ~1 connection/user (not worth the complexity)

For admin tools with <100 concurrent users watching 80-minute jobs, polling is perfect. If I had millions of users watching real-time sports scores, SSE would make sense.

Key takeaways

1. Redis TTL is a superpower

Set it and forget it for ephemeral data
No cron jobs, no manual cleanup
Perfect for progress tracking

2. Polling isn't always bad

For admin dashboards: simpler than WebSockets/SSE
Easy to debug (just curl the endpoint)
Browsers handle it fine (no memory leaks)

3. Database fallbacks provide peace of mind

Redis downtime is rare but real
Users prefer "slightly stale data" over "no data"

4. Estimate completion times carefully

Account for chunk delays (2 sec × 1,000 = 33 minutes)
Add network overhead buffers
Re-calculate on every update for accuracy

5. Scale considerations for 500K users

Total Redis memory: ~500KB per job (50MB for 100 jobs)
1,000 updates over 80 minutes = low write pressure
Polling every 3 seconds = ~1,600 reads per job (easily handled)

If I had to redesign this for 1M+ users, the only change would be reducing update frequency (every 5-10 chunks instead of every chunk) to cut Redis writes by 80-90%.

Have you built similar real-time monitoring for long-running jobs? What approach did you take? I'd love to hear your experiences in the comments.

DEV Community