DEV Community

cycy
cycy

Posted on

From Blocking to Blazing: How We Solved the 1000 Concurrent Users Problem

A beginner-friendly deep dive into sync vs async, WebSockets, and message queues through real-world problem solving

The Problem That Started It All

Picture this: You've built a parcel tracking app. Everything works beautifully with 10 users. Then 100. But what happens when 1000 delivery riders try to update their location simultaneously?

Your database crashes. Your app becomes unusable. Your users are furious.

This isn't hypothetical - this is exactly what we discovered during a recent architecture review. Let me walk you through how we identified the problem, understood the underlying concepts, and built a solution that scales to thousands of concurrent users.

The "Aha!" Moment: Finding the Bottleneck

When we first suspected performance issues, we didn't guess - we looked at the actual code:

@transaction.atomic  # 🚨 RED FLAG!
def update_rider_location(self, validated_data):
    rider = self.context['request'].user.rider
    rider.current_coordinates = validated_data['coordinates']
    rider.save(update_fields=["current_coordinates", "current_location"])
    return rider
Enter fullscreen mode Exit fullscreen mode

That innocent-looking @transaction.atomic decorator? It's a performance killer at scale.

Here's what happens with 1000 concurrent riders:

  • Each location update blocks the HTTP response until the database write completes
  • The database gets overwhelmed with simultaneous write operations
  • Response times climb from 50ms to 2+ seconds
  • Some requests timeout completely
  • Users experience the app as "broken"

Understanding Sync vs Async: The Foundation

Before jumping to solutions, we need to understand when to use synchronous vs asynchronous operations. Forget the oversimplified "revenue operations should be sync" rule - here's the real decision framework:

The UFSPD Framework

Ask these five questions:

U - User Experience: Does the user need immediate feedback?
F - Failure Impact: What happens if this operation fails?

S - State Consistency: Does the system need consistent state immediately?
P - Performance: What are your actual SLA requirements?
D - Dependencies: Does something else immediately depend on this completing?

Let's apply this to our location updates:

  • User Experience: Rider doesn't need to wait for location to be saved
  • Failure Impact: Failed location update can be retried without user knowing
  • State Consistency: Eventual consistency is fine for tracking
  • Performance: Should be sub-100ms for good UX
  • Dependencies: Nothing immediately depends on each individual location update

Verdict: This should be ASYNC

But what about delivery confirmation? That's different:

  • User Experience: Customer needs to know delivery is confirmed for payment
  • Failure Impact: Failed confirmation blocks payment flow
  • State Consistency: Payment state must be immediately consistent
  • Performance: Users expect confirmation within seconds
  • Dependencies: Payment calculation depends on this

Verdict: This should be SYNC

The Restaurant Kitchen Analogy: Understanding Message Queues

Think of your database like a restaurant kitchen. Currently, every order (location update) goes straight to the chef (database):

Without Message Queue (Current State):

  • 100 customers crowd around one chef
  • Chef gets overwhelmed
  • Orders get backed up
  • Some customers leave angry
  • Kitchen becomes chaotic

With Message Queue:

  • Customers place orders with the cashier (queue)
  • Orders go to a ticket system
  • Kitchen processes tickets at optimal pace
  • Customers get instant "order received" confirmation
  • Kitchen works smoothly even during rush hour

WebSockets: Keeping the Line Open

But there's another optimization we can make. Currently, each location update requires a new HTTP connection:

HTTP Approach (Current):

Rider Update 1: Connect β†’ Send β†’ Wait β†’ Disconnect
Rider Update 2: Connect β†’ Send β†’ Wait β†’ Disconnect  
Rider Update 3: Connect β†’ Send β†’ Wait β†’ Disconnect
Enter fullscreen mode Exit fullscreen mode

WebSocket Approach:

Initial: Connect β†’ Keep connection open
Update 1: Send (instant)
Update 2: Send (instant)
Update 3: Send (instant)
Enter fullscreen mode Exit fullscreen mode

It's like the difference between hanging up and redialing for each sentence vs keeping the phone line open during a conversation.

The Complete Solution: WebSocket + Message Queue

Here's how we architected the solution:

class LocationUpdateConsumer(AsyncWebsocketConsumer):
    async def connect(self):
        # Rider opens app β†’ WebSocket connects instantly
        self.rider_id = self.scope['user'].rider.id
        await self.accept()

    async def receive(self, text_data):
        location_data = json.loads(text_data)

        # Put in queue instead of direct DB write
        process_location_update.delay(location_data)

        # Instant response to rider!
        await self.send(text_data=json.dumps({
            'status': 'received',
            'timestamp': timezone.now().isoformat()
        }))
Enter fullscreen mode Exit fullscreen mode

And the background processing:

@celery_app.task(bind=True, max_retries=3)
def process_location_update(self, location_data):
    try:
        rider = Rider.objects.get(id=location_data['rider_id'])
        rider.current_coordinates = Point(location_data['lng'], location_data['lat'])
        rider.save(update_fields=['current_coordinates'])
    except Exception as exc:
        # Automatic retry on failure
        raise self.retry(exc=exc, countdown=60)
Enter fullscreen mode Exit fullscreen mode

The Results: From 500ms to 30ms

The performance improvement was dramatic:

Before (Blocking HTTP):

  • Response time: 500ms average
  • Failure rate: 15% with 1000 concurrent users
  • Database connections: 1000+ simultaneous
  • User experience: App feels "laggy"

After (WebSocket + Queue):

  • Response time: 30ms average (94% improvement!)
  • Failure rate: <0.1% (automatic retries handle failures)
  • Database connections: 10-20 steady workers
  • User experience: App feels "instant"

The Hybrid Confirmation Strategy

For critical operations like delivery confirmation, we implemented a hybrid approach:

  1. Primary: Automatic GPS-based confirmation when signal is strong
  2. Fallback: Manual button when GPS/network fails
  3. Verification: Photo capture + location audit trail

This ensures payments can always be processed, even when external APIs fail.

Key Takeaways for Your Architecture

  1. Measure First: Don't optimize based on assumptions. Look at your actual code and identify real bottlenecks.

  2. Use the Right Tool:

    • Sync for critical user flows (authentication, payments)
    • Async for background operations (notifications, analytics)
    • WebSockets for frequent real-time updates
    • Message queues for decoupling and reliability
  3. Think in Systems: Consider the entire flow, not just individual operations. A slow database write affects user experience even if the business logic is fast.

  4. Plan for Failure: Build fallbacks for critical operations. GPS fails, APIs go down, networks are unreliable.

  5. Start Simple: You don't need this architecture on day one. But understand these patterns so you can evolve your system as you scale.

What's Next?

This is just the beginning. Modern systems also need to consider:

  • Different message queue technologies (Kafka vs RabbitMQ vs AWS SNS)
  • gRPC vs HTTP for microservice communication
  • Pub/Sub patterns for event-driven architecture
  • Edge computing for global scale

The fundamentals we covered here - understanding sync vs async, using message queues for decoupling, and WebSockets for real-time communication - form the foundation for all these more advanced concepts.

Remember: great architecture isn't about using the latest technology. It's about understanding your users' needs, measuring your system's performance, and choosing the right patterns to deliver a fast, reliable experience.


Want to dive deeper into any of these concepts? Let's continue the conversation about building systems that scale.

Top comments (0)