ajithmanmu

Posted on Feb 23

I Built a Usage-Based Billing Engine From Scratch — Here's How It Works

#typescript #billing #aws #redis

I spent the last few weeks building MeterFlow — a usage-based billing engine that handles event ingestion, deduplication, aggregation, fraud detection, tiered pricing, and Stripe invoice generation.

This post walks through the technical decisions behind each component, including how the architecture maps to a production AWS deployment.

Why Build This?

I work on subscription infrastructure at my day job — Stripe integrations, webhook handlers, entitlement APIs. But I wanted to understand how billing platforms like Lago, Metronome, and Stripe Billing work internally. Not just calling the API, but building the metering and pricing layer myself.

MeterFlow covers the full lifecycle:

Events → Dedup → Store → Aggregate → Price → Invoice → Stripe

Stack: TypeScript, Fastify, Redis, ClickHouse, MinIO (S3-compatible), Docker Compose.

Architecture

┌──────────────┐
│   Fastify     │
│   API         │
│               │
│ POST /events  │──┬──────────────────────┐
│ GET /usage    │  │                      │
└───────────────┘  │                      │
                   ▼                      ▼
          ┌──────────────┐      ┌──────────────┐
          │ Redis         │      │     MinIO     │
          │               │      │   (S3 backup) │
          │ • Dedup (NX)  │      │               │
          │ • Rate Limit  │      │ Raw events    │
          │ • Auth keys   │      │ (append-only) │
          │ • Fraud bases │      └───────────────┘
          └───────┬───────┘
                  │
                  ▼
          ┌──────────────┐      ┌──────────────┐
          │  ClickHouse   │      │    Stripe     │
          │               │      │               │
          │ • Events      │      │ Draft invoice │
          │ • Aggregation │      │ Line items    │
          │ • Analytics   │      │ Finalize+send │
          └───────────────┘      └───────────────┘

Events come in through the API, get deduplicated via Redis, stored in ClickHouse for analytics, and backed up to S3. When billing runs, the system aggregates usage, applies pricing rules, and builds Stripe invoice payloads.

Every component was chosen with a clear production equivalent in mind — MinIO maps to S3, Redis to ElastiCache, the Fastify server to API Gateway + Lambda. More on that in the production architecture section below.

1. Event Ingestion & Deduplication

Billing systems can't double-count. If a client retries a request, we need to reject the duplicate without rejecting new events.

The approach: use Redis SET NX (set-if-not-exists) with a 30-day TTL. The transaction ID is the key.

const key = `dedup:${transaction_id}`;
const result = await redis.set(key, '1', 'EX', 2592000, 'NX');
// 'OK' → new event (accepted)
// null → duplicate (rejected)

This is atomic — two identical events hitting Redis simultaneously, only one wins. No race condition, no distributed lock needed.

The 30-day TTL matches the validation window. Events older than 30 days are rejected by business logic anyway, so dedup keys auto-expire.

For batch ingestion (up to 1,000 events/request), I pipeline the Redis calls so the entire batch is one round-trip. The API validates each event's schema, checks for required fields (customer_id, event_type, transaction_id, timestamp), and rejects events with timestamps outside the 30-day window before they even reach Redis.

Accepted events are then written to both ClickHouse (for querying) and MinIO/S3 (as an append-only backup). The S3 backup is organized by date (events/YYYY-MM-DD/batch_timestamp.json), giving you a full audit trail that's independent of the analytics store.

2. Rate Limiting with Sorted Sets

Standard token bucket has a boundary problem — 200 requests at 0:59 and 200 at 1:01 both pass their respective windows, but that's 400 in 2 seconds.

I used Redis sorted sets for a true sliding window:

const key = `ratelimit:${customer_id}`;
const now = Date.now();
const windowStart = now - 60000; // 60-second window

await redis.pipeline()
  .zremrangebyscore(key, 0, windowStart)  // Remove old entries
  .zadd(key, now, `${now}:${uuid}`)       // Add current request
  .zcard(key)                              // Count in window
  .expire(key, 120)                        // Safety TTL
  .exec();

Each request is a member with its timestamp as the score. To check the limit, remove anything older than 60 seconds, count what's left. The response includes X-RateLimit-Remaining so clients know where they stand.

In production, this pipeline approach could allow slight over-counting under high concurrency. A Lua script wrapping the same sorted set logic (ZREMRANGEBYSCORE → ZADD → ZCARD → EXPIRE) would execute atomically on the Redis server. You'd also want two layers: API Gateway throttling for coarse IP-based protection, and the application layer for fine-grained per-customer limits tied to billing tiers.

3. Billable Metrics Catalog

Rather than hardcoding what's billable, I use a catalog that maps raw events to billable quantities:

Metric	Event Type	Aggregation	Property
api_calls	api_request	COUNT	—
bandwidth	api_request	SUM	bytes
storage_peak	storage	MAX	gb_stored
compute_time	compute	SUM	cpu_ms

The usage query engine reads this catalog and builds the appropriate ClickHouse query dynamically. Adding a new metric means adding one config entry — no query changes.

// COUNT → SELECT count() FROM events WHERE ...
// SUM   → SELECT sum(JSONExtractFloat(properties, 'bytes')) WHERE ...
// MAX   → SELECT max(JSONExtractFloat(properties, 'gb_stored')) WHERE ...

ClickHouse is a good fit here because it's columnar — SUM(bytes) FROM events only reads the bytes column, not the entire row. But it's append-optimized, so you don't want to update or delete individual rows. For billing, that's fine — events are immutable.

4. Tiered Pricing Calculation

MeterFlow supports flat and tiered pricing. Tiered is the interesting one — the system walks through tiers progressively:

API Calls Pricing:
  Tier 1: 0–1,000     → $0.00/call (free tier)
  Tier 2: 1,001–10,000 → $0.001/call
  Tier 3: 10,001+      → $0.0005/call

For 15,000 API calls:

Tier 1: 1,000 × $0.00   = $0.00
Tier 2: 9,000 × $0.001  = $9.00
Tier 3: 5,000 × $0.0005 = $2.50
Total: $11.50

The pricing engine consumes quantity through each tier until it's exhausted. All amounts are converted to cents before hitting Stripe — billing systems should never do floating-point math on final amounts.

5. Fraud Detection

This is where MeterFlow goes beyond basic metering. It uses a two-layer approach to catch both volume anomalies and pattern anomalies.

Layer 1: Z-Score Volume Detection

Compare current usage against a 30-day baseline:

Z = (current_value - mean) / stddev

If |Z| >= 3 (three standard deviations), flag it. This catches obvious spikes — someone hammering your API 10x more than normal.

But it misses a critical attack vector: same volume, different pattern.

Layer 2: Cosine Similarity Pattern Detection

A stolen API key might generate the same number of calls per day, but at completely different hours. Z-score wouldn't catch this because the volume is normal.

The approach:

Build baselines — process 30 days of history into per-weekday, 24-dimensional hourly vectors (Mondays have different patterns than weekends)
Normalize — divide by the sum so we're comparing shape, not volume
Compare with cosine similarity — 1.0 means identical pattern, below 0.9 triggers a flag

Normal Tuesday:    [0.01, 0.01, ..., 0.15, 0.15, ..., 0.02]
                   (quiet at night, peaks 9am-5pm)

Stolen key usage:  [0.15, 0.15, ..., 0.01, 0.01, ..., 0.15]
                   (peaks at night — attacker in different timezone)

Cosine similarity: ~0.28 → FRAUD DETECTED

The volume is identical. The Z-score is normal. But the pattern is inverted. Cosine similarity catches it immediately.

Baselines are stored in Redis with 90-day TTL. The detection runs per-customer, per-metric, with separate weekday profiles. The system includes a dashboard that visualizes normal usage patterns vs. detected anomalies:

When fraud is detected, the dashboard highlights the anomaly with the cosine similarity score. In this example, a customer's pattern dropped to 30.2% similarity against their baseline — a clear sign of compromised credentials:

6. Stripe Integration

The billing endpoint builds complete Stripe API payloads following the full invoice lifecycle: create draft invoice, add line items per metric (with tier breakdowns in metadata), finalize, and send.

Each operation uses an idempotency key derived from the invoice ID and billing period:

const idempotencyKey =
  `meterflow_${invoiceId}_${customerId}_${periodStart}_${periodEnd}`;

This ensures that retries, duplicate triggers, or manual re-runs don't double-charge customers. Stripe rejects duplicate requests with the same key within 48 hours.

For the demo, this runs in dry-run mode — payloads are built but not sent to Stripe. Swapping to live is a one-line change from payload builders to actual SDK calls.

7. Production Architecture on AWS

Every local component in MeterFlow was designed with a clear AWS production mapping. Here's how the demo stack translates:

Demo (Local)	Production (AWS)	Why
Fastify server	API Gateway + Lambda	Auto-scaling, managed TLS, WAF
Redis (single)	ElastiCache (Redis Cluster)	HA, automatic failover
ClickHouse (Docker)	ClickHouse Cloud	Managed, scalable, VPC peering
MinIO	S3	Lifecycle policies, cross-region replication
In-process sync	Kinesis	Async buffer, back-pressure, replay
cron / manual	EventBridge	Managed scheduling, reliable triggers
Console logs	CloudWatch + SNS	Alerting, dashboards, PagerDuty

Async Ingestion with Kinesis

The biggest architectural shift for production is decoupling ingestion from processing. In the demo, the pipeline is synchronous: validate → dedup → store → backup, all in one request. In production, you'd buffer through Kinesis:

Client → API Gateway → Lambda (validate + dedup) → Kinesis → Lambda (store to ClickHouse)
                                                         ↘ S3 (backup)

Kinesis gives you ordered delivery within a shard (partition by customer_id), replayability if a downstream consumer fails, and natural back-pressure through shard limits. Clients receive 202 Accepted immediately instead of waiting for the full pipeline.

Failed batches route to an SQS dead letter queue for investigation and replay. The dedup layer (Redis SET NX) works the same way regardless of whether an event arrives via HTTP or Kinesis — duplicates are caught either way.

Scheduled Jobs with EventBridge

Billing, anomaly detection, and fraud baseline rebuilds all become scheduled jobs:

EventBridge (1st of month)  → Lambda → Aggregate usage → Stripe API (invoicing)
EventBridge (hourly/daily)  → Lambda → Z-score + cosine similarity → SNS (alerts)
EventBridge (weekly)        → Lambda → Rebuild fraud baselines → Redis

The detection logic itself (checkAnomaly(), checkFraud()) would be reused as-is from the demo — it already takes parameters for baseline window and threshold. The change is just in how it's triggered and where alerts go (SNS → Slack/PagerDuty instead of console logs).

State and Alerting

DynamoDB handles billing state (invoice status, anomaly records with TTL for auto-cleanup). SNS topics route to email, Slack, or PagerDuty based on alert severity. CloudWatch dashboards provide real-time visibility into ingestion rates, error rates, and billing job status.

What I Learned

Deduplication is deceptively simple. SET NX solves it cleanly, but the hard part is deciding what the dedup window should be and how to handle events that arrive after the window closes.

Billing math needs to happen in cents. Floating-point rounding will bite you. Convert to integers as early as possible.

Pattern-based fraud detection is more useful than volume-based. Sophisticated attackers will stay under volume thresholds. They can't easily replicate a customer's hourly usage pattern.

Design for the production target early. Using MinIO instead of a local filesystem, Redis instead of in-memory maps, and S3-compatible APIs from the start meant every component has a clear AWS upgrade path. The business logic doesn't change — only the infrastructure layer.

Try It

The entire system runs locally with Docker Compose:

git clone https://github.com/ajithmanmu/meterflow
cd meterflow
docker compose up -d
pnpm install && pnpm dev

There's a validation script that runs 60 end-to-end checks across all components, and demo scripts that simulate 30 days of normal usage followed by fraud injection so you can see the detection in action.

GitHub: github.com/ajithmanmu/meterflow

Inspired by Lago's open-source billing platform.

I'm a backend engineer building subscription and payment infrastructure. If you're working on billing systems or usage-based pricing, I'd love to hear about your approach.

DEV Community