Sivaram

Posted on Nov 10

How to Ingest 1 Billion Events Per Day Without Kafka: A Serverless Masterclass

#webdev #devops #cloudflare #architecture

Every so often, you stumble upon a system design that is so elegant and unconventional it forces you to rethink your entire approach to a problem. Recently, I saw a tweet from @jsneedles detailing a data pipeline he helped build that is now ingesting ~1 billion JSON events per day from a popular video game.

The most shocking part? There's no Kafka, no Kinesis, no massive, managed message queue in sight. Instead, it's a beautifully orchestrated dance of serverless primitives from Cloudflare, funneling data into the powerhouse OLAP database, ClickHouse.

This isn't just another architecture diagram; it's a lesson in leveraging the right tools for the job, embracing the edge, and achieving massive scale with startling simplicity and cost-effectiveness. Let's break it down.

Credit where credit is due: This entire analysis is based on the architecture shared by jsneedles on X. Go give him a follow!

The High-Level Flow

At its core, the pipeline is a multi-stage, asynchronous process designed to get data from millions of gamers worldwide into an analytics database with minimal latency and maximum efficiency.

Gamers → Receiver Worker → Tail Worker → CF Pipeline → R2 Storage → CF Queue → Insert Worker → ClickHouse

Now, let's dissect each component and understand the genius behind the choices made.

Step 1: Ingestion at the Edge (Receiver Worker)

The journey begins with a Cloudflare Worker. This isn't a typical server in a single region; it's a lightweight JavaScript V8 isolate deployed across Cloudflare's entire global network of 300+ data centers.

Why it's brilliant: When a game client anywhere in the world fires an event, it hits the geographically closest Worker. This minimizes network latency, providing a snappy response to the client. This "Receiver Worker" acts as the first line of defense—a globally distributed HTTP endpoint for validating schema, sanitizing data, and rejecting malformed events before they ever enter the pipeline.

But this is where the first piece of "out-of-the-box" thinking appears. The Receiver Worker's primary job isn't to forward the event via another HTTP request. Instead, it does something much simpler:

Note: Example/Assumed Receiver Worker Implementation

// Inside the Receiver Worker
export default {
    async fetch(request: Request) {
        try {
            const event = await request.json();
            // Validate and sanitize the event...
            if (isValid(event)) {
                // The magic trick: just log the data!
                console.log('PIPE_ME', JSON.stringify(event));
                return new Response('OK');
            }
            return new Response('Invalid event', { status: 400 });
        } catch (e) {
            return new Response('Error', { status: 500 });
        }
    },
};

Step 2: The Log-to-Stream "Hack" (Tail Worker)

This is the most innovative part of the entire architecture. They've effectively created a serverless, globally-distributed message queue by using Cloudflare Tail Workers.

A Tail Worker is a special type of Worker that can be configured to "tail" the console.log() output of another Worker. It receives the logs in near real-time, not as a single stream, but in batches.

Why it's brilliant: This completely bypasses the need for a traditional message broker. No Kafka clusters to manage, no SQS queues to configure. The console.log statement becomes the event producer. The 'PIPE_ME' prefix is a clever namespace, allowing the Tail Worker to filter for only the logs it cares about. Cloudflare's internal infrastructure handles the routing and initial batching from potentially thousands of concurrent Receiver Worker invocations into manageable chunks for the Tail Worker.

Note: Example/Assumed Tail Worker Implementation

// Inside the Tail Worker
export default {
    async tail(events: TraceItem[], env: Env) {
        // 'events' is an array of log entries from the Receiver Worker
        const batch = events
            .filter((e) => e.message.startsWith('PIPE_ME'))
            .map((e) => JSON.parse(e.message.slice(8))); // Extract the JSON

        if (batch.length > 0) {
            // Forward the batch to the next stage
            await env.PIPELINE.send(batch);
        }
    },
};

Step 3: Intelligent Batching (CF Pipeline)

The diagram shows that the Tail Worker feeds into a CF Pipeline that "Batches every 90 seconds." This is a crucial step for managing throughput. Writing one billion individual files to object storage would be an operational and financial nightmare due to the high number of write operations.

Why it's brilliant: The 90-second window is a "Goldilocks" duration. It's a fantastic trade-off between latency and efficiency. For an analytics use case, a 2-3 minute delay from event creation to query availability is perfectly acceptable. This window allows the pipeline to accumulate tens of thousands of events into a single, large object, which is vastly more efficient for both object storage and the downstream database.

Step 4: The Durable Buffer Zone (Cloudflare R2)

The batched data lands in Cloudflare R2, a highly durable, S3-compatible object storage service.

Why it's brilliant: R2's killer feature is the $0 egress fees. In a traditional cloud architecture (e.g., AWS), moving data from S3 to a compute service like Lambda or EC2 for processing incurs data transfer costs. At a scale of 79TB, these costs would be astronomical. With R2, the Insert Worker can pull the data for free. R2 acts as a vital decoupling layer. If the downstream database (ClickHouse) is down for maintenance, events continue to pile up safely and cheaply in R2, ready for processing once the database is back online.

Step 5: The Trigger (CF Queue & Insert Worker)

Once the Pipeline service writes a new batch file to R2, how does the final stage know it's time to act? This is where Cloudflare Queues come in.

R2 can be configured to send an event notification to a CF Queue whenever a new object is created. This message contains metadata about the new file (e.g., its key/name). A final "Insert Worker" is configured as a consumer for this queue.

Why it's brilliant: This is a classic, robust event-driven pattern. The Insert Worker is invoked only when there is work to be done. It's not polling R2 constantly. Its job is simple and focused:
1. Receive a message from the queue.
2. Use the message to fetch the large batch file from R2.
3. Parse the massive JSON array.
4. Perform a single, massive bulk insert into ClickHouse.

Step 6: The Analytics Powerhouse (ClickHouse)

The final destination is ClickHouse. For this kind of high-volume analytics workload, it's an exceptional choice.

Why it's brilliant: ClickHouse is a columnar OLAP (Online Analytical Processing) database.
- Columnar Storage: Unlike traditional row-based databases (like Postgres), it stores data in columns. This makes analytical queries that aggregate over a few columns (SELECT user_id, SUM(score) ...) blazingly fast, as the database only needs to read the data for those specific columns.
- Incredible Compression: This columnar nature allows for phenomenal compression. The post mentions compressing 79TB of raw data down to just 12TB. That's a ~6.5x reduction, saving immense storage costs.
- Built for Bulk Inserts: It thrives on ingesting large batches of data, which fits this architecture perfectly.

It's no surprise that data-intensive companies like Uber, TikTok, Ahrefs, and even Cloudflare itself rely on ClickHouse to power their real-time analytics.

Built-In Resilience and Optimizations: Handling Failures and Edge Cases Gracefully

What elevates this architecture from clever to production-hardened is its thoughtful handling of real-world messiness. Drawing from additional details shared by the original author, here's how they've baked in resilience without adding complexity:

Failure Tolerance as a Feature: The ingest layer (Receiver Worker) is intentionally simple and "fallible"—it's easy to debug and doesn't cascade errors. They don't chase "100% durability" because for analytics, losing a tiny fraction of events isn't catastrophic. This pragmatic choice keeps costs down and speeds up development.
R2 as the Ultimate Safety Net: Once events are batched and written to R2, the system becomes incredibly forgiving. Files can be re-ingested or re-queued at any time—perfect for handling upstream failures or manual replays. R2's 11 nines of durability ensures data sticks around without you stressing.
Mirrored Error Pipeline: A separate pipeline (mirroring the main one) captures parsing errors, storing raw payloads plus client metadata. It's "dormant" most of the time, meaning low overhead, but it's there when needed—turning potential data loss into actionable insights.
Retries, Backoff, and DLQ for Inserts: ClickHouse inserts use exponential backoff retries to avoid hammering a flaky service. If retries fail, events land in a Dead Letter Queue (DLQ) for later replay through the original pipeline. Failures are even logged to a dedicated ClickHouse table, making error analysis as simple as a SQL query.
Speed and Cost Optimizations with Tail Workers: By routing through Tail Workers, responses stay snappy—the Receiver doesn't wait on potentially slow Pipeline calls. It saves money too: just the cost of reading the log and emitting (minimal CPU time at ~$0.50/million GB-seconds).
The Hidden Large Payload Handler: Not in the diagram, but a stealth Worker handles oversized payloads. Instead of direct bindings (which are synchronous), it's invoked via HTTP to leverage waitUntil for async processing. This ensures the main path doesn't block, even for chunky events. Here's a Typescript sketch (with a Bun twist for edge-friendly async—imagine deploying this on a Bun-powered worker!):

Note: Example/Assumed Receiver Worker Implementation

// In Receiver Worker (for large payloads)
export default {
  async fetch(request: Request, env: Env) {
    const payload = await request.json();
    if (payload.size > MAX_SIZE) {  // e.g., 1MB threshold
      // Fire async HTTP to Large Payload Worker
      request.waitUntil(
        fetch('https://large-payload-worker.example.workers.dev', {
          method: 'POST',
          body: JSON.stringify(payload),
          headers: { 'Content-Type': 'application/json' },
        })
      );
      return new Response('OK');  // Respond fast!
    }
    // Normal path: console.log for Tail Worker
    console.log('PIPE_ME', JSON.stringify(payload));
    return new Response('OK');
  },
};

// In Large Payload Worker
export default {
  async fetch(request: Request) {
    const payload = await request.json();
    // Process and send to Pipeline directly
    await env.PIPELINE.send(payload);  // Or batch if needed
    return new Response('Processed');
  },
};

These tweaks show how the architecture isn't just scalable—it's resilient by design, turning potential weaknesses into strengths.

Why This Architecture Wins

Extreme Cost-Effectiveness: Let's look at the costs (based on 2025 pricing):
- Workers: Pay-per-request and CPU-ms. At ~$0.30/million requests, 1B events/day (~30B/mo) is significant, but far cheaper than provisioning a global fleet of servers.
- R2 Storage: ~$0.015/GB/month. But the real win is $0 egress fees.
- Queues: ~$0.40/million operations.
- This serverless model eliminates the cost of idle servers and the operational overhead of scaling.
Massive, Effortless Scalability: The entire pipeline is built on services designed for Cloudflare's planet-scale traffic. There are no servers to provision, no auto-scaling groups to configure. If traffic spikes from 1B to 5B events a day, the system just handles it.
Incredible Simplicity & Low Operational Overhead: Compare this to managing a self-hosted Kafka or Pulsar cluster. That requires dedicated teams for patching, scaling, monitoring, and disaster recovery. Here, the "message queue" is a console.log statement, and the "stream processor" is a managed service. This is a huge win for small, agile teams.

The Vendor Lock-In Question: A Pragmatic Choice

Is this architecture heavily tied to Cloudflare? Absolutely. But "vendor lock-in" is not an absolute evil; it's a trade-off.

For this specific use case—a high-volume, delay-tolerant analytics pipeline—the benefits gained are monumental:

Unified security and networking model.
Seamless integration between primitives (Workers, R2, Queues).
Massive reduction in operational complexity.

The cost and engineering effort required to build a vendor-agnostic equivalent would be orders of magnitude higher. Sometimes, leaning into a platform's strengths is the most pragmatic and powerful engineering decision you can make.

Final Thoughts

This architecture is a beautiful example of "serverless-native" thinking. It's not about lifting and shifting a traditional design to serverless components. It's about fundamentally rethinking the flow of data using the unique primitives offered by the edge.

It's a reminder that sometimes the most powerful solutions are not the most complex ones, but the ones that creatively leverage the platform they're built on.

What's the most innovative data pipeline you've seen recently? Share it in the comments below!

Top comments (1)

leob • Nov 11

Added this to my reading list coz I have the feeling it's deep ...