DEV Community

Cover image for Building Real-time Progress Tracking for Mass Push Notifications with Redis and Polling
Sangwoo Lee
Sangwoo Lee

Posted on

Building Real-time Progress Tracking for Mass Push Notifications with Redis and Polling

"Can we see the progress bar while notifications are being sent?"

My product manager asked this during a sprint review. We were sending push notifications to 500,000+ users, and the process took over 80 minutes. During that time, the admin dashboard just showed "Sending..." with no visibility into:

  • How many notifications were actually sent
  • Current success/failure rates
  • Estimated time remaining
  • Whether the job was stuck or progressing normally

The challenge: My workers run in separate Docker containers on different EC2 instances. The frontend can only talk to API servers—it has no direct connection to workers. How do I track progress across this distributed system in real-time?

In this post, I'll show you how I built a polling-based progress tracking system using Redis as a real-time cache, with database fallbacks for reliability.

The problem: distributed workers, centralized monitoring

My push notification architecture:

  • NestJS API servers (3 instances behind load balancer)
  • BullMQ workers (2 instances for blue-green deployment)
  • Redis for job queue and coordination
  • MSSQL for notification logs (one row per recipient)
  • MySQL for schedule metadata

A typical campaign flow:

  1. Admin creates a campaign targeting 500,000 users
  2. API server queues a BullMQ job with unique jobId
  3. Worker picks up the job and starts sending in chunks of 500
  4. Worker adds 2-second delays between chunks (FCM rate limits: 600K/minute)
  5. Total time: ~80 minutes for 500,000 recipients

The math:

  • 500,000 users ÷ 500 per chunk = 1,000 chunks
  • 1,000 chunks × 2 seconds delay = 2,000 seconds (~33 minutes of pure waiting)
  • + Actual FCM API calls (~47 minutes)
  • = ~80 minutes total

The problem? The frontend has no idea what's happening between step 3 and step 5.

Solution: Redis as a real-time progress cache

I needed a centralized progress store that:

  • Writes fast - Workers update every 2 seconds (per chunk)
  • Reads fast - Frontend polls every 2-5 seconds
  • Auto-expires - Completed jobs clean up automatically
  • Survives crashes - Progress persists if workers restart
  • Low overhead - Minimal impact on worker performance

Why not just query the database?

I could count rows in the notification logs table, but this has serious problems:

  1. Write amplification - 500,000 INSERT operations during sending
  2. Query load - Frontend polling creates constant SELECT COUNT(*) queries
  3. Lock contention - COUNT(*) locks table during active writes
  4. No estimated time - Can't calculate completion time from static logs

Redis solves all of these:

// redis.service.ts
private readonly PROGRESS_KEY_PREFIX = 'job:progress:';
private readonly PROGRESS_TTL_SECONDS = 3600; // 1 hour

async setProgressByJobId(params: JobProgressUpdateParams): Promise {
  const key = `${this.PROGRESS_KEY_PREFIX}${params.jobId}`;

  const progressPercentage = params.totalTargetCount > 0
    ? Math.round((params.currentSentCount / params.totalTargetCount) * 100)
    : 0;

  const progress: ScheduleProgress = {
    jobId: params.jobId,
    status: params.status, // 'preparing', 'in_progress', 'completed', 'failed'
    currentSentCount: params.currentSentCount,
    totalTargetCount: params.totalTargetCount,
    progressPercentage,
    estimatedEndAt: this.calculateEstimatedEndTime(params),
    // ... more fields
  };

  // TTL: 1 hour for completed, 24 hours for in-progress
  const ttl = params.status === 'completed' ? 3600 : 86400;
  await this.client.setex(key, ttl, JSON.stringify(progress));
}
Enter fullscreen mode Exit fullscreen mode

Key design decisions:

1. TTL for automatic cleanup

  • Completed jobs: 1-hour TTL (users rarely check after completion)
  • In-progress jobs: 24-hour TTL (safety net for very slow campaigns)
  • No cron jobs needed—Redis handles cleanup

2. Separate keys per job

  • Pattern: job:progress:{jobId}
  • Individual queries: GET job:progress:conditional-abc-123
  • Batch queries: KEYS job:progress:* (for dashboard)

The worker: updating progress during execution

Workers update Redis after completing each chunk:

// firebase.service.ts - Simplified core loop
async sendConditionalNotifications(jobData): Promise {
  const chunks = chunkArray(tokens, 500); // 1,000 chunks for 500K users

  // Initialize progress
  await this.redisService.setProgressByJobId({
    jobId,
    status: 'in_progress',
    totalTargetCount: tokens.length,
    currentSentCount: 0,
    totalChunks: chunks.length,
    currentChunk: 0,
    startedAt: new Date(),
  });

  // Process each chunk
  for (let i = 0; i < chunks.length; i++) {
    const response = await this.fcmService.sendBatch(chunks[i], title, content);

    totalSent += response.successCount;
    totalFailed += response.failureCount;

    // Update Redis every chunk (every 2 seconds)
    await this.redisService.setProgressByJobId({
      jobId,
      status: 'in_progress',
      currentSentCount: totalSent + totalFailed,
      totalChunks: chunks.length,
      currentChunk: i + 1,
      // ... other fields
    });

    await delay(2000); // FCM rate limiting
  }

  // Mark completed
  await this.redisService.setProgressByJobId({
    jobId,
    status: 'completed',
    completedAt: new Date(),
  });

  return true;
}
Enter fullscreen mode Exit fullscreen mode

Why update every chunk?

  • Users expect near-instant feedback
  • Redis writes are cheap (<1ms)
  • For 500K users (1,000 chunks), total overhead is ~1 second over 80 minutes

Three polling APIs for different use cases

1. Single job progress (campaign detail page)

// firebase.controller.ts
@Get('conditional-sends/:jobId/progress')
async getConditionalSendProgress(@Param('jobId') jobId: string) {
  const progress = await this.firebaseService.getConditionalSendProgress(jobId);

  if (!progress) {
    return {
      statusCode: 404,
      message: 'Progress not found. Job may have completed over 1 hour ago.',
      data: null,
    };
  }

  return {
    statusCode: 200,
    message: 'Progress retrieved successfully',
    data: progress,
  };
}
Enter fullscreen mode Exit fullscreen mode

Frontend polling:

// React example
useEffect(() => {
  const interval = setInterval(async () => {
    const response = await fetch(`/api/conditional-sends/${jobId}/progress`);
    const { data } = await response.json();

    if (data) {
      setProgress(data);
      if (data.status === 'completed') clearInterval(interval);
    }
  }, 3000); // Poll every 3 seconds

  return () => clearInterval(interval);
}, [jobId]);
Enter fullscreen mode Exit fullscreen mode

2. All active jobs (dashboard overview)

@Get('conditional-sends/active-progress')
async getAllActiveConditionalSendProgress() {
  const progressList = await this.firebaseService.getAllActiveConditionalSendProgress();

  return {
    statusCode: 200,
    message: progressList.length > 0 
      ? `${progressList.length} active campaigns found`
      : 'No active campaigns',
    data: progressList,
  };
}
Enter fullscreen mode Exit fullscreen mode
// redis.service.ts
async getAllActiveJobProgress(): Promise {
  const keys = await this.client.keys('job:progress:*');
  const results: ScheduleProgress[] = [];

  for (const key of keys) {
    const data = await this.client.get(key);
    if (data) {
      const progress = JSON.parse(data);
      if (progress.status === 'in_progress') {
        results.push(progress);
      }
    }
  }

  return results;
}
Enter fullscreen mode Exit fullscreen mode

3. Database fallback (scheduled sends)

For scheduled campaigns, I added a fallback for when Redis is down:

// scheduler.service.ts
async getAllActiveScheduleProgress(): Promise {
  // Try Redis first
  const progressList = await this.redisService.getAllActiveScheduleProgress();
  if (progressList.length > 0) return progressList;

  // Fallback to database
  const sendingSchedules = await this.scheduleRepository
    .createQueryBuilder('schedule')
    .where('schedule.sent_yn = 1') // Sending started
    .andWhere('schedule.actual_send_start_date IS NOT NULL')
    .andWhere('schedule.actual_send_end_date IS NULL') // Not finished
    .getMany();

  // Convert DB records to progress format (without real-time counts)
  return sendingSchedules.map(schedule => ({
    scheduleSeq: schedule.seq,
    status: 'in_progress',
    totalTargetCount: schedule.total_send_count || 0,
    currentSentCount: 0, // Real-time unavailable
    startedAt: schedule.actual_send_start_date?.toISOString(),
    // ...
  }));
}
Enter fullscreen mode Exit fullscreen mode

Estimating completion time accurately

The most requested feature: "When will this finish?"

For a 500K-user campaign taking 80 minutes, accurate estimates are critical:

// redis.service.ts
private calculateJobEstimatedEndTime(params): string | null {
  if (!params.startedAt || params.currentSentCount >= params.totalTargetCount) {
    return null;
  }

  const startTime = new Date(params.startedAt).getTime();
  const elapsedMs = Date.now() - startTime;

  // Current sending rate
  const currentRate = params.currentSentCount / elapsedMs;

  // Remaining work
  const remainingCount = params.totalTargetCount - params.currentSentCount;
  const remainingChunks = params.totalChunks - params.currentChunk;

  // Chunk delay overhead (2 seconds per chunk)
  const chunkDelayMs = remainingChunks > 0 ? (remainingChunks - 1) * 2000 : 0;

  // Pure sending time
  const pureSendingTimeMs = remainingCount / currentRate;

  // Network overhead (5% buffer)
  const networkOverheadMs = pureSendingTimeMs * 0.05;

  // Total estimate
  const totalRemainingMs = pureSendingTimeMs + chunkDelayMs + networkOverheadMs;
  const estimatedEndAt = new Date(Date.now() + totalRemainingMs);

  return estimatedEndAt.toISOString();
}
Enter fullscreen mode Exit fullscreen mode

Why this works for 80-minute jobs:

Example at 40% completion (chunk 400/1,000):

  • Elapsed: 32 minutes
  • Current rate: 200,000 sent / 32 min = 6,250 msg/min
  • Remaining: 300,000 messages
  • Pure sending: 300,000 / 6,250 = 48 minutes
  • Chunk delays: 600 chunks × 2 sec = 20 minutes
  • Network overhead: 48 min × 5% = 2.4 minutes
  • Total remaining: ~70 minutes
  • ETA: 32 + 70 = 102 minutes (accurate within ±10 min)

The estimate adjusts in real-time—if the sending rate slows down due to network issues, the ETA shifts accordingly.

Handling edge cases

Worker crashes mid-send

BullMQ's event handlers update Redis on failure:

// firebase.processor.ts
this.worker.on('failed', async (job, error) => {
  const progress = await this.redisService.getProgressByJobId(job.data.jobId);

  await this.redisService.setProgressByJobId({
    ...progress,
    status: 'failed',
    completedAt: new Date().toISOString(),
    errorMessage: error.message,
  });
});
Enter fullscreen mode Exit fullscreen mode

Frontend sees "Failed" status with partial progress, not phantom "in progress" forever.

Redis restart during send

Workers re-initialize progress on restart:

  1. Worker continues from last BullMQ checkpoint
  2. Re-creates Redis progress entry
  3. Frontend sees brief gap (5-10 seconds), then progress resumes

Polling storm (100+ admins watching)

100 clients polling every 3 seconds = 33 req/sec to Redis.

Redis handles this easily (can do 100K+ ops/sec), but I can add API-level caching:

// Future optimization: 1-second in-memory cache
const cache = new Map();

async getProgress(jobId: string) {
  const cached = cache.get(jobId);
  if (cached && Date.now() < cached.expiresAt) return cached.data;

  const progress = await this.redisService.getProgressByJobId(jobId);
  cache.set(jobId, { data: progress, expiresAt: Date.now() + 1000 });

  return progress;
}
Enter fullscreen mode Exit fullscreen mode

Reduces Redis load by 66% (3-sec poll / 1-sec cache = 3x reduction).

Production metrics and observations

After running this system in production for 3 months:

Performance:

  • Redis latency: <2ms (p99)
  • API endpoint latency: 15ms (p50), 45ms (p99)
  • Memory per job: ~500KB (~50MB for 100 concurrent jobs)

Reliability:

  • Uptime: 99.97% (1 Redis restart in 3 months)
  • DB fallback usage: 0.03% of requests
  • Stale data incidents: 0 (TTL works perfectly)

Real campaign example (500K users):

  • Start: 2025-01-10 09:00:00
  • Chunks: 1,000 (500 users each)
  • Progress updates: 1,000 Redis writes over 80 minutes
  • Frontend polls: ~1,600 requests (80 min ÷ 3 sec)
  • Total Redis ops: ~2,600 (writes + reads)
  • Redis CPU usage: <2%

Cost:

  • ElastiCache t3.micro ($15/month) handles 50 concurrent large campaigns easily
  • No database load increase—polling doesn't touch MSSQL

Why not Server-Sent Events (SSE)?

I implemented SSE (it's in the codebase) but didn't deploy it because:

1. Proxy/firewall issues

  • Corporate networks block long-lived HTTP connections
  • Users complained: "Progress doesn't update on office WiFi"

2. Load balancer complexity

  • AWS ALB has 30-second timeout by default
  • Requires sticky sessions or custom routing

3. Debugging is harder

  • Polling: Check Network tab, see every request/response
  • SSE: One opaque stream, harder to inspect

4. Reconnection logic

  • SSE connections drop on network blips
  • Frontend needs complex retry handling
  • Polling auto-recovers (next request succeeds)

The trade-off:

  • Polling overhead: ~3 requests/minute/user (negligible for Redis)
  • SSE benefit: ~1 connection/user (not worth the complexity)

For admin tools with <100 concurrent users watching 80-minute jobs, polling is perfect. If I had millions of users watching real-time sports scores, SSE would make sense.

Key takeaways

1. Redis TTL is a superpower

  • Set it and forget it for ephemeral data
  • No cron jobs, no manual cleanup
  • Perfect for progress tracking

2. Polling isn't always bad

  • For admin dashboards: simpler than WebSockets/SSE
  • Easy to debug (just curl the endpoint)
  • Browsers handle it fine (no memory leaks)

3. Database fallbacks provide peace of mind

  • Redis downtime is rare but real
  • Users prefer "slightly stale data" over "no data"

4. Estimate completion times carefully

  • Account for chunk delays (2 sec × 1,000 = 33 minutes)
  • Add network overhead buffers
  • Re-calculate on every update for accuracy

5. Scale considerations for 500K users

  • Total Redis memory: ~500KB per job (50MB for 100 jobs)
  • 1,000 updates over 80 minutes = low write pressure
  • Polling every 3 seconds = ~1,600 reads per job (easily handled)

If I had to redesign this for 1M+ users, the only change would be reducing update frequency (every 5-10 chunks instead of every chunk) to cut Redis writes by 80-90%.

Have you built similar real-time monitoring for long-running jobs? What approach did you take? I'd love to hear your experiences in the comments.

Top comments (0)