Sangwoo Lee

Posted on Jan 13

Beyond Token Validation: Measuring Real Device Delivery Rates with Firebase FCM

#firebase #fcm #pushtoken #pushnotification

How we measure actual device reach rates by analyzing production send results, classify error types, and maintain 85%+ delivery rates across 50M monthly notifications

"Our dry-run predicted 79% delivery rate. Production hit 80.2%."

My engineering lead looked at me skeptically. "That's... suspiciously accurate. How do we know the tokens we validated are still valid when we actually send?"

He had a point. Token validation (dry-run) tells you if a token was valid at validation time. Actual delivery tells you if the token was valid at send time. In high-volume push systems with millions of users, those two moments can be minutes, hours, or days apart.

In this post, I'll show you how we measure real device delivery rates by analyzing actual FCM send results, classify errors to understand why messages fail, and use that data to maintain 85%+ delivery rates across 50 million monthly notifications.

The gap between validation and reality

Here's what most tutorials don't tell you: Firebase dry-run and actual sends can produce different results.

Dry-run validation (covered in previous post):

const response = await admin.messaging().send(message, true); // dry-run = true
// Returns: { success: true } - token is valid RIGHT NOW

Actual send (this post):

const response = await admin.messaging().send(message); // dry-run = false  
// Returns: { success: true } - message DELIVERED to device

Why the difference matters:

Scenario	Dry-run Result	Actual Send Result
User uninstalled app 5 minutes ago	✅ Success	❌ `invalid-registration-token`
User's device is offline	✅ Success	✅ Success (queued for delivery)
FCM quota exceeded	✅ Success	❌ `quota-exceeded`
Network timeout during send	✅ Success	❌ `unavailable`

The reality: You need BOTH validation and delivery measurement.

Validation (dry-run): Pre-send health check
Delivery measurement (actual send): Ground truth

This post focuses on the latter: measuring what actually happened.

Why measure delivery rates at all?

Business impact:

When we started measuring delivery rates in January 2025, we discovered:

Reported: "Sent to 500,000 users"
Reality: Delivered to 350,000 devices (70% delivery rate)
Gap: 150,000 users never had a chance to see the notification

Stakeholder impact:

Product Manager: "We sent to 500K users, why did only 80K click?"
Dev Team: "That's a 16% click rate!" 
Product Manager: "No, it's actually 23% click rate (80K / 350K actual reach)"

Without delivery rate tracking, we were misunderstanding our engagement metrics.

Cost impact:

Processing 150K invalid tokens = 24 minutes of server time per campaign
50 campaigns/month × 24 minutes = 20 hours/month wasted
Database writes for failed sends = 7.5M unnecessary operations/month

How Firebase FCM responses reveal delivery truth

Every FCM send returns detailed response data:

const response = await admin.messaging().sendEachForMulticast({
  tokens: ['token1', 'token2', ...],
  notification: { title: 'Hello', body: 'World' },
  data: { campaignId: '12345' },
});

console.log(response);

Response structure:

{
  successCount: 7243,
  failureCount: 2757,
  responses: [
    // Success example
    {
      success: true,
      messageId: 'projects/my-app/messages/0:1234567890'
    },

    // Failure examples
    {
      success: false,
      error: {
        code: 'messaging/invalid-registration-token',
        message: 'The registration token is not valid anymore'
      }
    },
    {
      success: false,
      error: {
        code: 'messaging/server-unavailable',
        message: 'The server is temporarily unavailable'
      }
    },
    // ... 10,000 total responses (one per token)
  ]
}

The goldmine: The error.code field tells you exactly WHY a send failed.

Error classification: the key to actionable metrics

Not all failures are equal. Some are permanent (bad token), others are temporary (retry might work).

// fcm-error-classifier.ts
export type FcmErrorType = 'invalid_token' | 'temporary' | 'quota' | 'other';

export function classifyFcmError(errorCode?: string): FcmErrorType {
  if (!errorCode) return 'other';

  // ❌ PERMANENT FAILURES - Token is completely dead
  const INVALID_TOKEN_ERRORS = [
    'messaging/invalid-registration-token',
    'messaging/registration-token-not-registered',
    'messaging/invalid-argument',
  ];

  if (INVALID_TOKEN_ERRORS.includes(errorCode)) {
    return 'invalid_token';
  }

  // ⏳ TEMPORARY FAILURES - Retry might succeed
  const TEMPORARY_ERRORS = [
    'messaging/unavailable',
    'messaging/internal-error',
    'messaging/server-unavailable',
    'messaging/timeout',
    'messaging/unknown-error',
  ];

  if (TEMPORARY_ERRORS.includes(errorCode)) {
    return 'temporary';
  }

  // 🚫 QUOTA EXCEEDED - Rate limiting
  if (errorCode === 'messaging/quota-exceeded') {
    return 'quota';
  }

  // ❓ UNKNOWN ERRORS
  return 'other';
}

Why this classification matters:

Invalid Token (❌ Permanent):

Action: Remove from database immediately
Retry: Pointless - will always fail
Cause: User uninstalled app, token expired, device was factory reset

Temporary (⏳ Retry-able):

Action: Retry after backoff delay
Retry: 70-80% success rate on retry
Cause: Network hiccup, FCM infrastructure maintenance, device temporarily offline

Quota (🚫 Rate Limit):

Action: Wait and retry later
Retry: 100% success after rate limit window
Cause: Too many requests too fast

Other (❓ Unknown):

Action: Log for investigation
Retry: Case-by-case decision
Cause: New error types, unexpected scenarios

Storing delivery results: the audit trail

We store every single send result in the database for analysis:

// push-notification-log.entity.ts
@Entity({ name: 'push_notification_log' })
export class PushNotificationLog {
  @PrimaryGeneratedColumn({ type: 'bigint' })
  id: number;

  @Column({ type: 'varchar', length: 200 })
  job_id: string; // e.g., "production-blackfriday-2025"

  @Column({ type: 'int' })
  member_seq: number; // User identifier

  @Column({ type: 'varchar', length: 500 })
  push_token: string;

  // ⭐ Success/failure tracking
  @Column({ type: 'bit', default: false })
  is_success: boolean;

  @Column({ type: 'datetime2' })
  sent_at: Date;

  // ⭐ Error details (null if success)
  @Column({ type: 'varchar', length: 50, nullable: true })
  error_code: string; // FCM error code

  @Column({ type: 'nvarchar', length: 500, nullable: true })
  error_message: string;

  // ⭐ Error classification for analytics
  @Column({ type: 'varchar', length: 30, nullable: true })
  error_type: 'invalid_token' | 'temporary' | 'quota' | 'other';

  // Campaign details
  @Column({ type: 'nvarchar', length: 200 })
  title: string;

  @Column({ type: 'nvarchar', length: 1000 })
  content: string;

  // Metadata
  @Column({ type: 'int', nullable: true })
  chunk_index: number; // Which batch was this part of

  @Column({ type: 'bit', nullable: true, default: false })
  is_dry_run: boolean; // false = production send

  // Retry tracking
  @Column({ type: 'int', default: 0 })
  retry_count: number; // How many times retried

  @Column({ type: 'bit', default: false })
  retry_success: boolean; // Did retry succeed?
}

Indexes for fast queries:

@Index(['job_id', 'is_success']) // Fast delivery rate queries
@Index(['job_id', 'error_type']) // Fast error breakdown
@Index(['sent_at']) // Time-series analysis
@Index(['member_seq', 'sent_at']) // User-level tracking
export class PushNotificationLog { /* ... */ }

Implementation: capturing delivery results

Here's how we save every send result:

// firebase.service.ts
async sendConditionalNotifications(
  jobData: ConditionalNotificationParams
): Promise {

  // ... Get target tokens from database ...

  const tokens = await this.getTargetTokens(jobData);
  const chunks = chunkArray(tokens, 500); // 500 tokens per chunk

  let totalSuccess = 0;
  let totalFailed = 0;
  const errorStats = {
    invalid_token: 0,
    temporary: 0,
    quota: 0,
    other: 0,
  };

  for (let chunkIndex = 0; chunkIndex < chunks.length; chunkIndex++) {
    const chunk = chunks[chunkIndex];

    // Build FCM messages
    const messages = chunk.map(token => ({
      token,
      notification: { 
        title: jobData.title, 
        body: jobData.content 
      },
      data: {
        job_id: jobData.jobId,
        campaign_id: jobData.campaignId,
      },
    }));

    try {
      // ★ ACTUAL SEND (not dry-run)
      const response = await this.firebaseApp
        .messaging()
        .sendEachForMulticast({
          tokens: chunk,
          notification: messages[0].notification,
          data: messages[0].data,
        });

      console.log(`
        Chunk ${chunkIndex + 1}/${chunks.length}:
        ✅ Success: ${response.successCount}
        ❌ Failed: ${response.failureCount}
      `);

      // ⭐ ANALYZE EACH RESPONSE
      for (let i = 0; i < response.responses.length; i++) {
        const resp = response.responses[i];
        const message = messages[i];

        // Create log entry
        const log = new PushNotificationLog({
          job_id: jobData.jobId,
          member_seq: await this.getMemberSeq(message.token),
          push_token: message.token,
          title: jobData.title,
          content: jobData.content,
          sent_at: new Date(),
          chunk_index: chunkIndex,
          is_dry_run: false, // ✅ Production send
        });

        if (resp.success) {
          // ✅ Success - message delivered
          log.is_success = true;
          totalSuccess++;

        } else {
          // ❌ Failure - classify error
          log.is_success = false;
          log.error_code = resp.error?.code;
          log.error_message = resp.error?.message;
          log.error_type = classifyFcmError(resp.error?.code);

          totalFailed++;
          errorStats[log.error_type]++;

          // Log detailed error for debugging
          console.error(`
            Token: ${message.token.substring(0, 30)}...
            Error: ${resp.error?.code}
            Message: ${resp.error?.message}
          `);
        }

        // ⭐ SAVE TO DATABASE
        await this.pushNotificationLog.save(log);
      }

      // Rate limiting (prevent quota-exceeded)
      if (chunkIndex < chunks.length - 1) {
        await delay(2000); // 2 seconds between chunks
      }

    } catch (error) {
      console.error(`Chunk ${chunkIndex + 1} failed completely:`, error);

      // Save error logs for entire chunk
      for (const message of messages) {
        const log = new PushNotificationLog({
          job_id: jobData.jobId,
          member_seq: await this.getMemberSeq(message.token),
          push_token: message.token,
          title: jobData.title,
          content: jobData.content,
          sent_at: new Date(),
          chunk_index: chunkIndex,
          is_success: false,
          error_code: 'CHUNK_FAILURE',
          error_message: error.message,
          error_type: 'other',
          is_dry_run: false,
        });

        await this.pushNotificationLog.save(log);
      }

      totalFailed += chunk.length;
    }
  }

  // Calculate final delivery rate
  const deliveryRate = totalSuccess > 0
    ? parseFloat(((totalSuccess / (totalSuccess + totalFailed)) * 100).toFixed(2))
    : 0;

  console.log(`
    ========== SEND COMPLETE ==========
    Total tokens: ${(totalSuccess + totalFailed).toLocaleString()}
    ✅ Delivered: ${totalSuccess.toLocaleString()} (${deliveryRate}%)
    ❌ Failed: ${totalFailed.toLocaleString()}

    Error breakdown:
    - Invalid tokens: ${errorStats.invalid_token.toLocaleString()}
    - Temporary errors: ${errorStats.temporary.toLocaleString()}
    - Quota errors: ${errorStats.quota.toLocaleString()}
    - Other errors: ${errorStats.other.toLocaleString()}
    ===================================
  `);

  return {
    success: true,
    totalTokens: totalSuccess + totalFailed,
    deliveredCount: totalSuccess,
    failedCount: totalFailed,
    deliveryRate,
    errorStats,
  };
}

Calculating delivery metrics: beyond simple success rates

With detailed error classification, we calculate multiple metrics:

// fcm-error-classifier.ts
export interface FcmErrorStats {
  total: number;
  success: number;
  invalidToken: number;
  temporary: number;
  quota: number;
  other: number;
  deliveryRate: number;
  successRate: number;
  retryableRate: number;
}

export function calculateErrorStats(
  logs: PushNotificationLog[]
): FcmErrorStats {
  const total = logs.length;
  let success = 0;
  let invalidToken = 0;
  let temporary = 0;
  let quota = 0;
  let other = 0;

  for (const log of logs) {
    if (log.is_success) {
      success++;
    } else {
      switch (log.error_type) {
        case 'invalid_token': invalidToken++; break;
        case 'temporary': temporary++; break;
        case 'quota': quota++; break;
        default: other++; break;
      }
    }
  }

  // ⭐ DELIVERY RATE
  // = Successfully delivered / Total attempted
  const deliveryRate = total > 0
    ? parseFloat(((success / total) * 100).toFixed(2))
    : 0;

  // ⭐ SUCCESS RATE (same as delivery rate for actual sends)
  const successRate = deliveryRate;

  // ⭐ RETRYABLE RATE
  // = Tokens that could succeed on retry / Total attempted
  const retryableRate = total > 0
    ? parseFloat((((temporary + quota) / total) * 100).toFixed(2))
    : 0;

  return {
    total,
    success,
    invalidToken,
    temporary,
    quota,
    other,
    deliveryRate,
    successRate,
    retryableRate,
  };
}

Example output:

const stats = calculateErrorStats(productionLogs);

console.log(`
📊 Delivery Analysis:
======================
Total Attempted: ${stats.total.toLocaleString()}

✅ Delivered: ${stats.success.toLocaleString()}
❌ Invalid Tokens: ${stats.invalidToken.toLocaleString()}
⏳ Temporary Errors: ${stats.temporary.toLocaleString()}
🚫 Quota Errors: ${stats.quota.toLocaleString()}
❓ Other Errors: ${stats.other.toLocaleString()}

📈 Key Metrics:
- Delivery Rate: ${stats.deliveryRate}%
- Retryable Rate: ${stats.retryableRate}%
- Permanent Failure Rate: ${((stats.invalidToken / stats.total) * 100).toFixed(2)}%
`);

Sample output:

📊 Delivery Analysis:
======================
Total Attempted: 500,000

✅ Delivered: 401,000
❌ Invalid Tokens: 87,000
⏳ Temporary Errors: 11,000
🚫 Quota Errors: 800
❓ Other Errors: 200

📈 Key Metrics:
- Delivery Rate: 80.2%
- Retryable Rate: 2.4%
- Permanent Failure Rate: 17.4%

Retry logic: recovering from temporary failures

Temporary errors (unavailable, timeout) often succeed on retry. Here's our retry strategy:

// fcm-retry.utils.ts
export async function sendEachWithRetry(
  messaging: admin.messaging.Messaging,
  messages: admin.messaging.Message[],
  isDryRun: boolean,
  retryConfig: {
    maxRetries: number;
    initialDelayMs: number;
    maxDelayMs: number;
  }
): Promise {

  const { maxRetries, initialDelayMs, maxDelayMs } = retryConfig;

  let attempt = 0;
  let lastError: Error | null = null;

  while (attempt <= maxRetries) {
    try {
      // Attempt send
      const response = await messaging.sendEach(messages, isDryRun);

      // If no failures, return immediately
      if (response.failureCount === 0) {
        return response;
      }

      // Check if failures are retryable
      const retryableIndices: number[] = [];

      response.responses.forEach((resp, idx) => {
        if (!resp.success) {
          const errorType = classifyFcmError(resp.error?.code);

          if (errorType === 'temporary' || errorType === 'quota') {
            retryableIndices.push(idx);
          }
        }
      });

      // If no retryable failures, return current response
      if (retryableIndices.length === 0) {
        console.log(`No retryable errors. Returning after attempt ${attempt + 1}`);
        return response;
      }

      // Last attempt - return as-is
      if (attempt === maxRetries) {
        console.log(`Max retries (${maxRetries}) reached. Returning with ${retryableIndices.length} remaining failures.`);
        return response;
      }

      // Prepare retry batch
      const retryMessages = retryableIndices.map(idx => messages[idx]);

      // Exponential backoff
      const delayMs = Math.min(
        initialDelayMs * Math.pow(2, attempt),
        maxDelayMs
      );

      console.log(`
        Attempt ${attempt + 1}/${maxRetries + 1}:
        - Retryable failures: ${retryableIndices.length}
        - Waiting ${delayMs}ms before retry
      `);

      await delay(delayMs);

      // Retry just the failed messages
      const retryResponse = await messaging.sendEach(retryMessages, isDryRun);

      // Merge retry results back into original response
      retryResponse.responses.forEach((retryResp, idx) => {
        const originalIdx = retryableIndices[idx];
        response.responses[originalIdx] = retryResp;
      });

      // Recalculate success/failure counts
      response.successCount = response.responses.filter(r => r.success).length;
      response.failureCount = response.responses.filter(r => !r.success).length;

      console.log(`
        After retry:
        - Success: ${response.successCount}
        - Failed: ${response.failureCount}
      `);

      // If all succeeded after retry, return
      if (response.failureCount === 0) {
        return response;
      }

      // Otherwise, continue to next retry attempt
      attempt++;

    } catch (error) {
      console.error(`Attempt ${attempt + 1} threw exception:`, error);
      lastError = error;
      attempt++;

      if (attempt <= maxRetries) {
        const delayMs = Math.min(
          initialDelayMs * Math.pow(2, attempt - 1),
          maxDelayMs
        );
        await delay(delayMs);
      }
    }
  }

  // All retries failed
  throw lastError || new Error('All retry attempts failed');
}

Retry configuration:

// Production sends
const response = await sendEachWithRetry(
  messaging,
  messages,
  false, // isDryRun = false (actual send)
  {
    maxRetries: 3,        // Try up to 3 times
    initialDelayMs: 1000, // Start with 1 second
    maxDelayMs: 5000,     // Cap at 5 seconds
  }
);

// Results:
// - Attempt 1: Instant
// - Attempt 2: After 1 second (2^0 * 1000ms)
// - Attempt 3: After 2 seconds (2^1 * 1000ms)
// - Attempt 4: After 4 seconds (2^2 * 1000ms, but capped at 5s)

Retry success rates (our production data):

Error Type          | 1st Retry | 2nd Retry | 3rd Retry | Overall
--------------------|-----------|-----------|-----------|--------
temporary           |    78%    |    15%    |     4%    |   97%
quota               |    95%    |     4%    |     1%    |  100%
invalid_token       |     0%    |     0%    |     0%    |    0%
other               |    45%    |    20%    |    10%    |   75%

Key insight: 97% of temporary errors eventually succeed with retry!

Tracking retry results in the database

We track both initial send and retry outcomes:

// After initial send
const initialLog = new PushNotificationLog({
  // ... basic fields ...
  is_success: false,
  error_type: 'temporary',
  retry_count: 0,
  retry_success: false,
});

await repository.save(initialLog);

// After successful retry
if (retryResponse.success) {
  await repository.update(
    { id: initialLog.id },
    {
      retry_count: attemptNumber,
      retry_success: true,
      // Note: is_success stays false - original send failed
    }
  );

  console.log(`Token ${token} succeeded on retry ${attemptNumber}`);
}

Query retry effectiveness:

-- Retry success rate by error type
SELECT 
  error_type,
  COUNT(*) as total_failures,
  SUM(CASE WHEN retry_success = 1 THEN 1 ELSE 0 END) as retry_successes,
  CAST(SUM(CASE WHEN retry_success = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) as retry_success_rate
FROM push_notification_log
WHERE is_success = 0
  AND retry_count > 0
  AND sent_at >= DATEADD(month, -1, GETDATE())
GROUP BY error_type
ORDER BY total_failures DESC;

Example result:

error_type     | total_failures | retry_successes | retry_success_rate
---------------|----------------|-----------------|-------------------
temporary      |      45,230    |      43,875     |      97.0%
quota          |       2,180    |       2,180     |     100.0%
other          |       1,520    |       1,140     |      75.0%
invalid_token  |      87,430    |           0     |       0.0%

API endpoint: real-time delivery statistics

We expose delivery metrics via REST API:

// GET /api/campaigns/:jobId/delivery-stats
async getDeliveryStats(jobId: string): Promise {
  try {
    const logs = await this.pushNotificationLog.find({
      where: { job_id: jobId },
      select: ['is_success', 'error_code', 'error_type', 'retry_success'],
    });

    if (logs.length === 0) {
      throw new NotFoundException(`No logs found for campaign: ${jobId}`);
    }

    const stats = calculateErrorStats(logs);

    // Additional retry analysis
    const retriedLogs = logs.filter(log => log.retry_count > 0);
    const retrySuccessCount = retriedLogs.filter(log => log.retry_success).length;
    const retrySuccessRate = retriedLogs.length > 0
      ? parseFloat(((retrySuccessCount / retriedLogs.length) * 100).toFixed(2))
      : 0;

    console.log(`[getDeliveryStats] Campaign: ${jobId}`);
    console.log(`  Delivery Rate: ${stats.deliveryRate}%`);
    console.log(`  Retry Success Rate: ${retrySuccessRate}%`);

    return {
      ...stats,
      retryAttempts: retriedLogs.length,
      retrySuccesses: retrySuccessCount,
      retrySuccessRate,
    };

  } catch (error) {
    console.error(`[getDeliveryStats] Error:`, error);
    throw error;
  }
}

Response example:

GET /api/campaigns/production-blackfriday-2025/delivery-stats

{
  "total": 500000,
  "success": 401000,
  "invalidToken": 87000,
  "temporary": 11000,
  "quota": 800,
  "other": 200,
  "deliveryRate": 80.2,
  "successRate": 80.2,
  "retryableRate": 2.4,
  "retryAttempts": 11800,
  "retrySuccesses": 11450,
  "retrySuccessRate": 97.0
}

Automated token cleanup: removing dead tokens

After each campaign, we automatically clean up invalid tokens:

// Runs after campaign completes
async function cleanupInvalidTokens(jobId: string) {
  console.log(`[Cleanup] Starting for campaign: ${jobId}`);

  // Get all permanently invalid tokens
  const invalidLogs = await pushNotificationLog.find({
    where: {
      job_id: jobId,
      error_type: 'invalid_token', // Permanent failures only
    },
    select: ['member_seq', 'push_token', 'error_code'],
  });

  console.log(`[Cleanup] Found ${invalidLogs.length} invalid tokens`);

  if (invalidLogs.length === 0) {
    console.log(`[Cleanup] Nothing to clean up`);
    return;
  }

  // Batch update member table
  const batchSize = 1000;
  let updated = 0;

  for (let i = 0; i < invalidLogs.length; i += batchSize) {
    const batch = invalidLogs.slice(i, i + batchSize);
    const memberSeqs = batch.map(log => log.member_seq);

    await memberRepository
      .createQueryBuilder()
      .update(Member)
      .set({
        push_token_valid: false,
        push_token_invalidated_at: () => 'GETDATE()',
        push_token_invalid_reason: 'fcm_invalid_token',
      })
      .whereInIds(memberSeqs)
      .execute();

    updated += batch.length;
    console.log(`[Cleanup] Updated ${updated}/${invalidLogs.length}`);
  }

  // Log cleanup event
  await cleanupEventLog.save({
    campaign_job_id: jobId,
    tokens_invalidated: invalidLogs.length,
    executed_at: new Date(),
  });

  console.log(`[Cleanup] ✅ Complete - ${invalidLogs.length} tokens marked invalid`);
}

Database schema for token validity:

ALTER TABLE member
ADD push_token_valid BIT DEFAULT 1,
ADD push_token_invalidated_at DATETIME2 NULL,
ADD push_token_invalid_reason VARCHAR(50) NULL;

-- Index for fast filtering
CREATE INDEX idx_member_valid_tokens 
ON member(push_token_valid, push_token)
WHERE push_token IS NOT NULL;

Future queries automatically exclude invalid tokens:

// ✅ Only query valid tokens
const tokens = await memberRepository
  .createQueryBuilder('m')
  .where('m.push_token_valid = 1') // Valid tokens only
  .andWhere('m.push_token IS NOT NULL')
  .select(['m.push_token', 'm.seq'])
  .getMany();

console.log(`Found ${tokens.length} valid tokens (invalid tokens excluded)`);

Measuring improvement over time

With consistent delivery tracking, we can measure token health trends:

-- Monthly delivery rate trends
SELECT 
  FORMAT(sent_at, 'yyyy-MM') as month,
  COUNT(*) as total_sends,
  SUM(CASE WHEN is_success = 1 THEN 1 ELSE 0 END) as successful_sends,
  CAST(SUM(CASE WHEN is_success = 1 THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS DECIMAL(5,2)) as delivery_rate
FROM push_notification_log
WHERE is_dry_run = 0 -- Production sends only
  AND sent_at >= DATEADD(month, -6, GETDATE())
GROUP BY FORMAT(sent_at, 'yyyy-MM')
ORDER BY month;

Our 6-month trend (with automated cleanup):

Month    | Total Sends | Successful | Delivery Rate
---------|-------------|------------|---------------
2025-01  |  25,000,000 | 17,500,000 |     70.0%
2025-02  |  28,000,000 | 21,280,000 |     76.0%
2025-03  |  30,000,000 | 24,000,000 |     80.0%
2025-04  |  32,000,000 | 26,560,000 |     83.0%
2025-05  |  35,000,000 | 29,750,000 |     85.0%
2025-06  |  38,000,000 | 32,680,000 |     86.0%

Improvement: 70% → 86% delivery rate (+16 percentage points)
Reason: Automated cleanup after each campaign

Production metrics: 50M monthly notifications

Current performance (June 2025):

Monthly Volume: 50,000,000 notifications
Delivery Rate: 86.0%
Retry Success Rate: 97.0%

Error Breakdown:
- Invalid tokens: 12.0% (down from 30% in Jan)
- Temporary errors: 1.8%
- Quota errors: 0.1%
- Other errors: 0.1%

Time Saved (vs. no cleanup):
- 14% fewer invalid tokens per campaign
- ~12 minutes saved per campaign
- 50 campaigns/month × 12 min = 600 min/month = 10 hours/month

Cost savings (compared to January baseline):

Before cleanup automation (Jan 2025):
- Invalid token rate: 30%
- Monthly invalid sends: 15,000,000
- Server time wasted: 40 hours/month
- DB operations wasted: 15M writes/month

After cleanup automation (Jun 2025):
- Invalid token rate: 12%
- Monthly invalid sends: 6,000,000
- Server time wasted: 16 hours/month
- DB operations wasted: 6M writes/month

Savings:
- Server time: 24 hours/month × $0.10/min × 60 = $144/month = $1,728/year
- DB operations: 9M writes/month × $0.0001 = $900/month = $10,800/year
- Total: $12,528/year

Key takeaways

1. Measure actual delivery, not just attempts

// ❌ Misleading metric
console.log(`Sent to ${totalAttempts} users!`);

// ✅ Accurate metric  
console.log(`Delivered to ${actualSuccess} devices (${deliveryRate}%)`);

2. Classify errors for actionable insights

Invalid tokens: Remove immediately
Temporary errors: Retry with backoff
Quota errors: Wait and retry
Other errors: Investigate and classify

3. Retry temporary failures (97% eventually succeed)

const response = await sendEachWithRetry(messaging, messages, false, {
  maxRetries: 3,
  initialDelayMs: 1000,
  maxDelayMs: 5000,
});

4. Automate token cleanup after every campaign

Improves delivery rate over time (70% → 86% in 6 months)
Reduces wasted processing (24 hours/month saved)
Maintains database health automatically

5. Track trends to measure improvement

SELECT FORMAT(sent_at, 'yyyy-MM'), 
       AVG(delivery_rate)
FROM monthly_delivery_stats
GROUP BY FORMAT(sent_at, 'yyyy-MM');

6. Store everything for forensic analysis

Success/failure for every token
Error codes and classifications
Retry attempts and results
Timestamp for time-series analysis

When to use actual delivery measurement vs dry-run validation

Use both in sequence:

// Phase 1: Dry-run validation (pre-send health check)
const validation = await sendConditionalNotifications({
  ...campaignData,
  jobId: 'dryrun-campaign-123',
  limit: 10000, // Sample
  isDryRun: true,
});

console.log(`Predicted delivery rate: ${validation.deliveryRate}%`);

// Phase 2: Actual send (ground truth measurement)
const production = await sendConditionalNotifications({
  ...campaignData,
  jobId: 'production-campaign-123',
  limit: undefined, // Full send
  isDryRun: false,
});

console.log(`Actual delivery rate: ${production.deliveryRate}%`);
console.log(`Prediction accuracy: ${Math.abs(validation.deliveryRate - production.deliveryRate).toFixed(1)}%`);

Dry-run (validation):

✅ Fast (2 minutes for 10K sample)
✅ Zero user impact
✅ Predicts delivery rate
❌ Not 100% accurate (time gap between test and send)

Actual send (delivery measurement):

✅ 100% accurate (ground truth)
✅ Reveals real-world issues (network, devices, timing)
✅ Enables retry logic
❌ Slower (60+ minutes for 500K)
❌ User impact (notifications sent)

Best practice: Use both

Dry-run validation before large campaigns (risk mitigation)
Actual delivery measurement for all sends (ground truth)
Compare validation vs delivery for continuous improvement

DEV Community