DEV Community

Cover image for Why Batch CDR Processing Breaks Your VoIP Billing at Scale
Jack Morris
Jack Morris

Posted on

Why Batch CDR Processing Breaks Your VoIP Billing at Scale

If your VoIP billing system processes CDRs in overnight batches, it has a ceiling. The exact ceiling depends on your traffic, your customer mix, and your risk tolerance. The ceiling exists, and you'll hit it.

I've watched this break in production three times. Here's the pattern and what works instead.

The batch problem
Most off-the-shelf VoIP billing software still uses a batch model under the hood. CDRs accumulate during the day. Overnight a job rates them, updates balances, generates invoices, and triggers fraud alerts.

That worked when call volumes were modest and pricing decisions could wait until morning. At anything above 1M CDRs per day with prepaid customers or wholesale partners, the model produces specific failure modes:

  • Prepaid customers complete calls past their balance because the system doesn't know they ran out until tomorrow
  • Fraud patterns run for 12–18 hours before detection
  • Margin alerts fire long after an unprofitable route has already cost real money
  • Reconciliation reports take longer to run than the daily reporting window allows

The first time I saw a prepaid balance leak cost an operator $40K over a single weekend, I stopped recommending batch architectures for serious VoIP billing.

What real-time looks like

The architecture that holds up is event-driven. Every CDR enters a stream the moment the switch writes it. Rating, balance updates, fraud checks, and margin calculations happen in the pipeline before the CDR is persisted.

Rough shape:

[Switch CDR] → [Event Queue] → [Rating Worker] → [Balance/Fraud/Margin] → [Persist + Notify]
                                      ↓
                              [Real-time Alert Stream]
Enter fullscreen mode Exit fullscreen mode

Components I've found that actually work at scale:

  • Kafka or Redpanda for the CDR event stream (NATS works for smaller volume)
  • A stateless rating worker pool (Go or Rust) that pulls events, applies rate logic, emits decisions
  • Redis for hot balance lookups (sub-millisecond reads, periodic snapshots to Postgres)
  • A rules engine for fraud patterns. We used a simple DSL evaluated at runtime; OPA also works
  • Postgres for the system of record with CDR tables partitioned by date

End-to-end latency target is 50ms or less for credit checks and balance updates. We hit roughly 18ms average on a deployment doing 8M CDRs per day across three rating workers.

The schema decision that mattered
The architectural decision that defined how well the system scaled was separating the rating decision from the CDR record.

Old pattern (batch-friendly):

CREATE TABLE cdr (
  id BIGSERIAL,
  start_time TIMESTAMPTZ,
  duration_ms INT,
  src VARCHAR,
  dst VARCHAR,
  cost NUMERIC(10,6),  -- updated by batch job
  customer_id BIGINT,
  rated BOOLEAN DEFAULT FALSE
);
Enter fullscreen mode Exit fullscreen mode

New pattern (event-driven):

CREATE TABLE cdr_raw (
  id BIGSERIAL,
  start_time TIMESTAMPTZ,
  duration_ms INT,
  src VARCHAR,
  dst VARCHAR,
  customer_id BIGINT,
  ingested_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE rating_decision (
  cdr_id BIGINT,
  rate_id BIGINT,
  cost NUMERIC(10,6),
  margin NUMERIC(10,6),
  rated_at TIMESTAMPTZ,
  decision_latency_ms INT
);
Enter fullscreen mode Exit fullscreen mode

Separating the rating decision means you can re-rate calls without touching the raw CDR. Disputes get resolved by adding a new rating decision, not updating history. Audit trails are automatic. Re-runs after rate corrections are trivial.

The batch-friendly schema looked simpler at first. It made everything harder later.

What the rating worker actually does
Pseudo-flow for a single CDR:

def rate_cdr(cdr_event):
    customer = get_customer_cached(cdr_event.customer_id)

    # Real-time balance check (prepaid)
    if customer.is_prepaid:
        if not balance_sufficient(customer, estimated_cost(cdr_event)):
            emit_alert("balance_exhausted", customer.id)
            return reject(cdr_event)

    # Apply rating
    rate = lookup_rate(customer.plan_id, cdr_event.dst)
    cost = calculate_cost(cdr_event.duration_ms, rate)

    # Margin check (wholesale)
    if customer.is_wholesale:
        margin = cost - termination_cost(cdr_event)
        if margin < customer.min_margin:
            emit_alert("low_margin", customer.id, route=cdr_event.dst)

    # Fraud check
    if matches_fraud_pattern(cdr_event):
        emit_alert("fraud_suspected", customer.id)

    # Persist decision
    persist_rating_decision(cdr_event.id, rate.id, cost)
    update_balance(customer.id, -cost)

    return accept(cdr_event)
Enter fullscreen mode Exit fullscreen mode

Whole thing runs in 15–25ms typical. Worst case 80ms when the rate lookup misses Redis and hits Postgres.

What I'd skip building
A few things tempted me to build that turned out to be wrong:

- Custom rule DSLs for everything: Use a real expression language (CEL, JSONLogic, an embedded scripting language). Custom DSLs become technical debt within a year.
- Per-tenant Redis clusters: Single Redis with proper key prefixing scales further than you think. Tenant isolation belongs at the application layer.
- Real-time invoicing: Real-time rating decisions, yes. Invoice generation is fine as a daily job. Don't conflate them.

The decision that comes up

At some point, every operator running batch-based VoIP billing has to decide whether to fix the current platform or build something that handles real-time decisioning from the ground up. Off-the-shelf platforms are getting better at it. Most still struggle past 2M CDRs per day with mixed retail and wholesale.

For teams that need real-time rating and don't have the in-house specialist depth, a build engagement with a telecom-billing specialist tends to land faster than trying to retrofit a generic platform. The custom VoIP billing software solutions work at Hire VoIP Developer covers this kind of build-and-handover for operators sitting at exactly that decision point.

Real-time CDR processing isn't exotic anymore. It's just the architecture VoIP billing needs at scale. If you're still running overnight batch, you're either small enough that it doesn't matter, or large enough that it already does.

Top comments (0)