"Can we see the progress bar while notifications are being sent?"
My product manager asked this during a sprint review. We were sending push notifications to 500,000+ users, and the process took over 80 minutes. During that time, the admin dashboard just showed "Sending..." with no visibility into:
- How many notifications were actually sent
- Current success/failure rates
- Estimated time remaining
- Whether the job was stuck or progressing normally
The challenge: My workers run in separate Docker containers on different EC2 instances. The frontend can only talk to API servers—it has no direct connection to workers. How do I track progress across this distributed system in real-time?
In this post, I'll show you how I built a polling-based progress tracking system using Redis as a real-time cache, with database fallbacks for reliability.
The problem: distributed workers, centralized monitoring
My push notification architecture:
- NestJS API servers (3 instances behind load balancer)
- BullMQ workers (2 instances for blue-green deployment)
- Redis for job queue and coordination
- MSSQL for notification logs (one row per recipient)
- MySQL for schedule metadata
A typical campaign flow:
- Admin creates a campaign targeting 500,000 users
- API server queues a BullMQ job with unique
jobId - Worker picks up the job and starts sending in chunks of 500
- Worker adds 2-second delays between chunks (FCM rate limits: 600K/minute)
- Total time: ~80 minutes for 500,000 recipients
The math:
- 500,000 users ÷ 500 per chunk = 1,000 chunks
- 1,000 chunks × 2 seconds delay = 2,000 seconds (~33 minutes of pure waiting)
- + Actual FCM API calls (~47 minutes)
- = ~80 minutes total
The problem? The frontend has no idea what's happening between step 3 and step 5.
Solution: Redis as a real-time progress cache
I needed a centralized progress store that:
- ✅ Writes fast - Workers update every 2 seconds (per chunk)
- ✅ Reads fast - Frontend polls every 2-5 seconds
- ✅ Auto-expires - Completed jobs clean up automatically
- ✅ Survives crashes - Progress persists if workers restart
- ✅ Low overhead - Minimal impact on worker performance
Why not just query the database?
I could count rows in the notification logs table, but this has serious problems:
- Write amplification - 500,000 INSERT operations during sending
- Query load - Frontend polling creates constant SELECT COUNT(*) queries
- Lock contention - COUNT(*) locks table during active writes
- No estimated time - Can't calculate completion time from static logs
Redis solves all of these:
// redis.service.ts
private readonly PROGRESS_KEY_PREFIX = 'job:progress:';
private readonly PROGRESS_TTL_SECONDS = 3600; // 1 hour
async setProgressByJobId(params: JobProgressUpdateParams): Promise {
const key = `${this.PROGRESS_KEY_PREFIX}${params.jobId}`;
const progressPercentage = params.totalTargetCount > 0
? Math.round((params.currentSentCount / params.totalTargetCount) * 100)
: 0;
const progress: ScheduleProgress = {
jobId: params.jobId,
status: params.status, // 'preparing', 'in_progress', 'completed', 'failed'
currentSentCount: params.currentSentCount,
totalTargetCount: params.totalTargetCount,
progressPercentage,
estimatedEndAt: this.calculateEstimatedEndTime(params),
// ... more fields
};
// TTL: 1 hour for completed, 24 hours for in-progress
const ttl = params.status === 'completed' ? 3600 : 86400;
await this.client.setex(key, ttl, JSON.stringify(progress));
}
Key design decisions:
1. TTL for automatic cleanup
- Completed jobs: 1-hour TTL (users rarely check after completion)
- In-progress jobs: 24-hour TTL (safety net for very slow campaigns)
- No cron jobs needed—Redis handles cleanup
2. Separate keys per job
- Pattern:
job:progress:{jobId} - Individual queries:
GET job:progress:conditional-abc-123 - Batch queries:
KEYS job:progress:*(for dashboard)
The worker: updating progress during execution
Workers update Redis after completing each chunk:
// firebase.service.ts - Simplified core loop
async sendConditionalNotifications(jobData): Promise {
const chunks = chunkArray(tokens, 500); // 1,000 chunks for 500K users
// Initialize progress
await this.redisService.setProgressByJobId({
jobId,
status: 'in_progress',
totalTargetCount: tokens.length,
currentSentCount: 0,
totalChunks: chunks.length,
currentChunk: 0,
startedAt: new Date(),
});
// Process each chunk
for (let i = 0; i < chunks.length; i++) {
const response = await this.fcmService.sendBatch(chunks[i], title, content);
totalSent += response.successCount;
totalFailed += response.failureCount;
// Update Redis every chunk (every 2 seconds)
await this.redisService.setProgressByJobId({
jobId,
status: 'in_progress',
currentSentCount: totalSent + totalFailed,
totalChunks: chunks.length,
currentChunk: i + 1,
// ... other fields
});
await delay(2000); // FCM rate limiting
}
// Mark completed
await this.redisService.setProgressByJobId({
jobId,
status: 'completed',
completedAt: new Date(),
});
return true;
}
Why update every chunk?
- Users expect near-instant feedback
- Redis writes are cheap (<1ms)
- For 500K users (1,000 chunks), total overhead is ~1 second over 80 minutes
Three polling APIs for different use cases
1. Single job progress (campaign detail page)
// firebase.controller.ts
@Get('conditional-sends/:jobId/progress')
async getConditionalSendProgress(@Param('jobId') jobId: string) {
const progress = await this.firebaseService.getConditionalSendProgress(jobId);
if (!progress) {
return {
statusCode: 404,
message: 'Progress not found. Job may have completed over 1 hour ago.',
data: null,
};
}
return {
statusCode: 200,
message: 'Progress retrieved successfully',
data: progress,
};
}
Frontend polling:
// React example
useEffect(() => {
const interval = setInterval(async () => {
const response = await fetch(`/api/conditional-sends/${jobId}/progress`);
const { data } = await response.json();
if (data) {
setProgress(data);
if (data.status === 'completed') clearInterval(interval);
}
}, 3000); // Poll every 3 seconds
return () => clearInterval(interval);
}, [jobId]);
2. All active jobs (dashboard overview)
@Get('conditional-sends/active-progress')
async getAllActiveConditionalSendProgress() {
const progressList = await this.firebaseService.getAllActiveConditionalSendProgress();
return {
statusCode: 200,
message: progressList.length > 0
? `${progressList.length} active campaigns found`
: 'No active campaigns',
data: progressList,
};
}
// redis.service.ts
async getAllActiveJobProgress(): Promise {
const keys = await this.client.keys('job:progress:*');
const results: ScheduleProgress[] = [];
for (const key of keys) {
const data = await this.client.get(key);
if (data) {
const progress = JSON.parse(data);
if (progress.status === 'in_progress') {
results.push(progress);
}
}
}
return results;
}
3. Database fallback (scheduled sends)
For scheduled campaigns, I added a fallback for when Redis is down:
// scheduler.service.ts
async getAllActiveScheduleProgress(): Promise {
// Try Redis first
const progressList = await this.redisService.getAllActiveScheduleProgress();
if (progressList.length > 0) return progressList;
// Fallback to database
const sendingSchedules = await this.scheduleRepository
.createQueryBuilder('schedule')
.where('schedule.sent_yn = 1') // Sending started
.andWhere('schedule.actual_send_start_date IS NOT NULL')
.andWhere('schedule.actual_send_end_date IS NULL') // Not finished
.getMany();
// Convert DB records to progress format (without real-time counts)
return sendingSchedules.map(schedule => ({
scheduleSeq: schedule.seq,
status: 'in_progress',
totalTargetCount: schedule.total_send_count || 0,
currentSentCount: 0, // Real-time unavailable
startedAt: schedule.actual_send_start_date?.toISOString(),
// ...
}));
}
Estimating completion time accurately
The most requested feature: "When will this finish?"
For a 500K-user campaign taking 80 minutes, accurate estimates are critical:
// redis.service.ts
private calculateJobEstimatedEndTime(params): string | null {
if (!params.startedAt || params.currentSentCount >= params.totalTargetCount) {
return null;
}
const startTime = new Date(params.startedAt).getTime();
const elapsedMs = Date.now() - startTime;
// Current sending rate
const currentRate = params.currentSentCount / elapsedMs;
// Remaining work
const remainingCount = params.totalTargetCount - params.currentSentCount;
const remainingChunks = params.totalChunks - params.currentChunk;
// Chunk delay overhead (2 seconds per chunk)
const chunkDelayMs = remainingChunks > 0 ? (remainingChunks - 1) * 2000 : 0;
// Pure sending time
const pureSendingTimeMs = remainingCount / currentRate;
// Network overhead (5% buffer)
const networkOverheadMs = pureSendingTimeMs * 0.05;
// Total estimate
const totalRemainingMs = pureSendingTimeMs + chunkDelayMs + networkOverheadMs;
const estimatedEndAt = new Date(Date.now() + totalRemainingMs);
return estimatedEndAt.toISOString();
}
Why this works for 80-minute jobs:
Example at 40% completion (chunk 400/1,000):
- Elapsed: 32 minutes
- Current rate: 200,000 sent / 32 min = 6,250 msg/min
- Remaining: 300,000 messages
- Pure sending: 300,000 / 6,250 = 48 minutes
- Chunk delays: 600 chunks × 2 sec = 20 minutes
- Network overhead: 48 min × 5% = 2.4 minutes
- Total remaining: ~70 minutes
- ETA: 32 + 70 = 102 minutes (accurate within ±10 min)
The estimate adjusts in real-time—if the sending rate slows down due to network issues, the ETA shifts accordingly.
Handling edge cases
Worker crashes mid-send
BullMQ's event handlers update Redis on failure:
// firebase.processor.ts
this.worker.on('failed', async (job, error) => {
const progress = await this.redisService.getProgressByJobId(job.data.jobId);
await this.redisService.setProgressByJobId({
...progress,
status: 'failed',
completedAt: new Date().toISOString(),
errorMessage: error.message,
});
});
Frontend sees "Failed" status with partial progress, not phantom "in progress" forever.
Redis restart during send
Workers re-initialize progress on restart:
- Worker continues from last BullMQ checkpoint
- Re-creates Redis progress entry
- Frontend sees brief gap (5-10 seconds), then progress resumes
Polling storm (100+ admins watching)
100 clients polling every 3 seconds = 33 req/sec to Redis.
Redis handles this easily (can do 100K+ ops/sec), but I can add API-level caching:
// Future optimization: 1-second in-memory cache
const cache = new Map();
async getProgress(jobId: string) {
const cached = cache.get(jobId);
if (cached && Date.now() < cached.expiresAt) return cached.data;
const progress = await this.redisService.getProgressByJobId(jobId);
cache.set(jobId, { data: progress, expiresAt: Date.now() + 1000 });
return progress;
}
Reduces Redis load by 66% (3-sec poll / 1-sec cache = 3x reduction).
Production metrics and observations
After running this system in production for 3 months:
Performance:
- Redis latency: <2ms (p99)
- API endpoint latency: 15ms (p50), 45ms (p99)
- Memory per job: ~500KB (~50MB for 100 concurrent jobs)
Reliability:
- Uptime: 99.97% (1 Redis restart in 3 months)
- DB fallback usage: 0.03% of requests
- Stale data incidents: 0 (TTL works perfectly)
Real campaign example (500K users):
- Start: 2025-01-10 09:00:00
- Chunks: 1,000 (500 users each)
- Progress updates: 1,000 Redis writes over 80 minutes
- Frontend polls: ~1,600 requests (80 min ÷ 3 sec)
- Total Redis ops: ~2,600 (writes + reads)
- Redis CPU usage: <2%
Cost:
- ElastiCache t3.micro ($15/month) handles 50 concurrent large campaigns easily
- No database load increase—polling doesn't touch MSSQL
Why not Server-Sent Events (SSE)?
I implemented SSE (it's in the codebase) but didn't deploy it because:
1. Proxy/firewall issues
- Corporate networks block long-lived HTTP connections
- Users complained: "Progress doesn't update on office WiFi"
2. Load balancer complexity
- AWS ALB has 30-second timeout by default
- Requires sticky sessions or custom routing
3. Debugging is harder
- Polling: Check Network tab, see every request/response
- SSE: One opaque stream, harder to inspect
4. Reconnection logic
- SSE connections drop on network blips
- Frontend needs complex retry handling
- Polling auto-recovers (next request succeeds)
The trade-off:
- Polling overhead: ~3 requests/minute/user (negligible for Redis)
- SSE benefit: ~1 connection/user (not worth the complexity)
For admin tools with <100 concurrent users watching 80-minute jobs, polling is perfect. If I had millions of users watching real-time sports scores, SSE would make sense.
Key takeaways
1. Redis TTL is a superpower
- Set it and forget it for ephemeral data
- No cron jobs, no manual cleanup
- Perfect for progress tracking
2. Polling isn't always bad
- For admin dashboards: simpler than WebSockets/SSE
- Easy to debug (just curl the endpoint)
- Browsers handle it fine (no memory leaks)
3. Database fallbacks provide peace of mind
- Redis downtime is rare but real
- Users prefer "slightly stale data" over "no data"
4. Estimate completion times carefully
- Account for chunk delays (2 sec × 1,000 = 33 minutes)
- Add network overhead buffers
- Re-calculate on every update for accuracy
5. Scale considerations for 500K users
- Total Redis memory: ~500KB per job (50MB for 100 jobs)
- 1,000 updates over 80 minutes = low write pressure
- Polling every 3 seconds = ~1,600 reads per job (easily handled)
If I had to redesign this for 1M+ users, the only change would be reducing update frequency (every 5-10 chunks instead of every chunk) to cut Redis writes by 80-90%.
Have you built similar real-time monitoring for long-running jobs? What approach did you take? I'd love to hear your experiences in the comments.
Top comments (0)