Ujjawal Tyagi

Posted on Apr 24

The Real Engineering Cost of Scaling an Indian Cricket App From 1K to 1M Users on Kafka

#microservices #kafka #node #scaling

Most engineering blogs that talk about Kafka either quote the Netflix case study or walk you through the "hello world" producer-consumer example. Both are useless if you're actually trying to decide whether to adopt Kafka for a product that might hit a million users — or stay at ten thousand forever.

This is a story from the trenches of building Cricket Winner, a real-time cricket scores, news, and opinion-trading platform for the Indian market. I'll walk through the decisions we made, the decisions we got wrong, and the numbers we wish someone had shared with us when we started.

The Product, in One Paragraph

Cricket Winner does three things in real time: it streams live ball-by-ball updates during matches, pushes breaking cricket news as it happens, and runs an opinion-trading engine where users take positions on outcomes ("Will Kohli score a fifty this innings?"). During an India vs. Pakistan match, a single over can produce a burst of 40,000+ user actions — upvotes, trades, opinion updates, comment floods. The rest of the time, the system sits idle.

That bursty profile is why we chose Kafka. It's also why we almost abandoned Kafka three months in.

The Decision That Nearly Killed Us

When we started, our backend was a monolithic Node.js service with a PostgreSQL database and WebSockets for live updates. It worked fine for 5,000 concurrent users during beta testing with a state-league match. Then the first IPL match hit production, we crossed 60,000 concurrent users in the first 20 minutes, and WebSocket connection storms took the entire app down for 90 minutes.

The post-mortem was simple: our "real-time" architecture was secretly synchronous. Every ball update had to fan out through the same Node process that was also handling auth, trading settlements, and news dispatch. One slow database query anywhere in the chain backed up every single live connection.

We had two options:

Add more Node instances behind a load balancer (classic horizontal scale) — but our database was a single Postgres instance, so we'd just be moving the bottleneck.
Decouple the real-time pipeline from the transactional pipeline using an event-driven architecture.

We picked option 2. That meant Kafka.

Why Kafka and Not RabbitMQ

This is the question we get most often from other Indian dev teams. We already used RabbitMQ on another project (Veda Milk, a D2C dairy subscription platform) and we'd been happy with it. So why not reuse it?

Three reasons:

Throughput ceiling. RabbitMQ peaks at ~50K messages/sec per node in practical Indian cloud setups. Kafka comfortably does 500K+ on the same hardware. During a cricket match final, we've hit 180K messages/sec sustained. RabbitMQ would have fallen over.

Replay. During an outage (and there will be outages), Kafka lets you replay the event log. That matters a lot when your opinion-trading engine crashed mid-match and you need to reconstruct user positions from the last 30 minutes of events. RabbitMQ is a "fire and forget" queue — once a message is acked, it's gone.

Fan-out topology. In our system, a single "ball bowled" event needs to reach four consumers simultaneously: the live score broadcaster, the trading settlement engine, the notification dispatcher, and the analytics pipeline. Kafka's consumer group model handles this natively. With RabbitMQ you end up creating exchange-bindings that get complex fast.

For Veda Milk we kept RabbitMQ — it's a D2C subscription platform where order volume is high but bursts are small and replay doesn't matter. Right tool for the right job.

The Architecture We Actually Ship

Here's our production topology, simplified:

[Mobile/Web client]
     | (WebSocket)
[API Gateway — Node.js + Socket.io]
     | (Kafka producer)
[Kafka cluster — 3 brokers, AWS MSK]
     | (Kafka consumers)
  |-- Live score service (publishes to socket rooms)
  |-- Trading engine (writes positions to Postgres)
  |-- Notifications (pushes FCM/APNs)
  |-- Analytics (writes to ClickHouse)
  |-- News dispatcher (fan-out to feed service)

Each consumer is its own microservice, deployed independently on ECS. When the trading engine deploys a new version, the live scores keep streaming. When the news dispatcher has a bug, it can be restarted without affecting anything else. Before this decoupling, a single deployment would cause a 15-second read-availability blip for everyone.

The Cost That Nobody Tells You About

Every Kafka blog post glosses over this: operating Kafka is not free, and it's not easy.

Our AWS MSK bill for Cricket Winner alone runs ₹85,000–₹1,10,000/month (approximately $1,000–$1,300). That's for a 3-broker cluster (kafka.m5.large), with replication factor 3. On top of that, we run self-managed Zookeeper for historical reasons (newer MSK versions use KRaft, we're migrating).

The human cost is bigger. You need at least one engineer on the team who understands:

Consumer-group rebalancing dynamics (what happens when a consumer crashes mid-partition-assignment).
Partition key selection (we partition by match_id, not user_id — took us a month to figure that out).
Exactly-once vs at-least-once semantics and how to write idempotent consumers.
How to debug a lagging consumer at 2am when India is beating Australia in the final.

If you don't have that engineer, or can't hire one, Kafka will punish you. The "hello world" Kafka is lovely. Production Kafka has sharp edges.

What We'd Tell Our Past Selves

Five lessons, compressed:

1. Don't adopt Kafka until you actually need it. Before we had 50K+ concurrent users, RabbitMQ + good database indexes would have been plenty. We jumped to Kafka slightly too early. The complexity tax was real.

2. Partition keys are everything. Your throughput, your ordering guarantees, and your consumer parallelism are all downstream of this one decision. Spend a week thinking about it before you ship.

3. Write idempotent consumers from day one. You will reprocess events. You will redeploy during a match. You will have duplicates. If your consumer isn't idempotent, every one of those events will corrupt state.

4. Monitor consumer lag, not throughput. Throughput tells you what happened. Consumer lag tells you what's about to break. We added lag alerts on day 40 of production; we should have added them on day 1.

5. The "Kafka vs. RabbitMQ" debate is a false dichotomy in real systems. We run both. Kafka for high-throughput event streams that need fan-out and replay. RabbitMQ for work queues and background jobs. Use the right one for the right job.

What's Actually Possible in the Indian Market

If you're a founder reading this and wondering whether your app needs this kind of architecture — here's the honest answer:

You need Kafka if: you're building real-time (gaming, trading, cricket, live sports, live commerce), your traffic has 10x+ spikes, and you need to fan out the same event to 3+ downstream systems. Cricket Winner, FYZ (short-video), Bolcall (dating app with live calls) all fit this profile.

You don't need Kafka if: you're building a marketplace, a standard e-commerce site, a services platform, or an internal B2B tool. RabbitMQ is enough. We built Veda Milk, Nursery Wallah, Cremaster, Housecare, Abomed, My Shaadi Store, and Prepe — all subscription or marketplace products — on RabbitMQ and none of them have needed Kafka.

The cost of over-engineering is real. The cost of under-engineering only hits you once, but it hits you hard. Get it right the first time by being honest about what you're actually building.

Taking the Leap

Building a real-time platform in India is different from building one in the US. Our users have flakier connectivity, lower tolerance for lag, and usage patterns that cluster around a few moments (the ball, the goal, the finale) rather than distributing evenly through the day. That shapes your architecture in ways Western tech blogs don't capture.

If you're evaluating whether to go monolith vs microservices, or RabbitMQ vs Kafka, for your next product — the answer is almost always "start simpler than you think, and decouple only when the pain is real." That's the path we took at Xenotix Labs, and it's the path we'd take again. You can browse more of our engineering case studies in our software development case studies — including the full Cricket Winner teardown — or read about the full microservices and Kafka stack we use across our builds.

The boring truth about production systems is that the best architecture is always the simplest one that works. Kafka is a fantastic tool. It's also a sharp one. Use accordingly.

*Ujjawal Tyagi is the founder of Xenotix Labs, a product engineering studio that's shipped 30+ production apps across Flutter, Next.js, and Node.js for Indian and international startups. Case studies include Cricket Winner (real-time trading on Kafka), Veda Milk (D2C dairy on RabbitMQ), and Growara (AI WhatsApp automation).

Top comments (1)

Andrew Tan • Apr 27

Your point about the websocket connection storms is spot-on. I noticed you mention the same node process was handling auth, trading, and live scores all at once. Did you consider keeping RabbitMQ for the low-volume streams and only routing match-day traffic through Kafka?