cycy

Posted on Sep 5

From Blocking to Blazing: How We Solved the 1000 Concurrent Users Problem

#architecture #backenddevelopment #websockets #systemdesign

A beginner-friendly deep dive into sync vs async, WebSockets, and message queues through real-world problem solving

The Problem That Started It All

Picture this: You've built a parcel tracking app. Everything works beautifully with 10 users. Then 100. But what happens when 1000 delivery riders try to update their location simultaneously?

Your database crashes. Your app becomes unusable. Your users are furious.

This isn't hypothetical - this is exactly what we discovered during a recent architecture review. Let me walk you through how we identified the problem, understood the underlying concepts, and built a solution that scales to thousands of concurrent users.

The "Aha!" Moment: Finding the Bottleneck

When we first suspected performance issues, we didn't guess - we looked at the actual code:

@transaction.atomic  # 🚨 RED FLAG!
def update_rider_location(self, validated_data):
    rider = self.context['request'].user.rider
    rider.current_coordinates = validated_data['coordinates']
    rider.save(update_fields=["current_coordinates", "current_location"])
    return rider

That innocent-looking @transaction.atomic decorator? It's a performance killer at scale.

Here's what happens with 1000 concurrent riders:

Each location update blocks the HTTP response until the database write completes
The database gets overwhelmed with simultaneous write operations
Response times climb from 50ms to 2+ seconds
Some requests timeout completely
Users experience the app as "broken"

Understanding Sync vs Async: The Foundation

Before jumping to solutions, we need to understand when to use synchronous vs asynchronous operations. Forget the oversimplified "revenue operations should be sync" rule - here's the real decision framework:

The UFSPD Framework

Ask these five questions:

U - User Experience: Does the user need immediate feedback?
F - Failure Impact: What happens if this operation fails?

S - State Consistency: Does the system need consistent state immediately?
P - Performance: What are your actual SLA requirements?
D - Dependencies: Does something else immediately depend on this completing?

Let's apply this to our location updates:

User Experience: Rider doesn't need to wait for location to be saved
Failure Impact: Failed location update can be retried without user knowing
State Consistency: Eventual consistency is fine for tracking
Performance: Should be sub-100ms for good UX
Dependencies: Nothing immediately depends on each individual location update

Verdict: This should be ASYNC

But what about delivery confirmation? That's different:

User Experience: Customer needs to know delivery is confirmed for payment
Failure Impact: Failed confirmation blocks payment flow
State Consistency: Payment state must be immediately consistent
Performance: Users expect confirmation within seconds
Dependencies: Payment calculation depends on this

Verdict: This should be SYNC

The Restaurant Kitchen Analogy: Understanding Message Queues

Think of your database like a restaurant kitchen. Currently, every order (location update) goes straight to the chef (database):

Without Message Queue (Current State):

100 customers crowd around one chef
Chef gets overwhelmed
Orders get backed up
Some customers leave angry
Kitchen becomes chaotic

With Message Queue:

Customers place orders with the cashier (queue)
Orders go to a ticket system
Kitchen processes tickets at optimal pace
Customers get instant "order received" confirmation
Kitchen works smoothly even during rush hour

WebSockets: Keeping the Line Open

But there's another optimization we can make. Currently, each location update requires a new HTTP connection:

HTTP Approach (Current):

Rider Update 1: Connect → Send → Wait → Disconnect
Rider Update 2: Connect → Send → Wait → Disconnect  
Rider Update 3: Connect → Send → Wait → Disconnect

WebSocket Approach:

Initial: Connect → Keep connection open
Update 1: Send (instant)
Update 2: Send (instant)
Update 3: Send (instant)

It's like the difference between hanging up and redialing for each sentence vs keeping the phone line open during a conversation.

The Complete Solution: WebSocket + Message Queue

Here's how we architected the solution:

class LocationUpdateConsumer(AsyncWebsocketConsumer):
    async def connect(self):
        # Rider opens app → WebSocket connects instantly
        self.rider_id = self.scope['user'].rider.id
        await self.accept()

    async def receive(self, text_data):
        location_data = json.loads(text_data)

        # Put in queue instead of direct DB write
        process_location_update.delay(location_data)

        # Instant response to rider!
        await self.send(text_data=json.dumps({
            'status': 'received',
            'timestamp': timezone.now().isoformat()
        }))

And the background processing:

@celery_app.task(bind=True, max_retries=3)
def process_location_update(self, location_data):
    try:
        rider = Rider.objects.get(id=location_data['rider_id'])
        rider.current_coordinates = Point(location_data['lng'], location_data['lat'])
        rider.save(update_fields=['current_coordinates'])
    except Exception as exc:
        # Automatic retry on failure
        raise self.retry(exc=exc, countdown=60)

The Results: From 500ms to 30ms

The performance improvement was dramatic:

Before (Blocking HTTP):

Response time: 500ms average
Failure rate: 15% with 1000 concurrent users
Database connections: 1000+ simultaneous
User experience: App feels "laggy"

After (WebSocket + Queue):

Response time: 30ms average (94% improvement!)
Failure rate: <0.1% (automatic retries handle failures)
Database connections: 10-20 steady workers
User experience: App feels "instant"

The Hybrid Confirmation Strategy

For critical operations like delivery confirmation, we implemented a hybrid approach:

Primary: Automatic GPS-based confirmation when signal is strong
Fallback: Manual button when GPS/network fails
Verification: Photo capture + location audit trail

This ensures payments can always be processed, even when external APIs fail.

Key Takeaways for Your Architecture

Measure First: Don't optimize based on assumptions. Look at your actual code and identify real bottlenecks.
Use the Right Tool:
- Sync for critical user flows (authentication, payments)
- Async for background operations (notifications, analytics)
- WebSockets for frequent real-time updates
- Message queues for decoupling and reliability
Think in Systems: Consider the entire flow, not just individual operations. A slow database write affects user experience even if the business logic is fast.
Plan for Failure: Build fallbacks for critical operations. GPS fails, APIs go down, networks are unreliable.
Start Simple: You don't need this architecture on day one. But understand these patterns so you can evolve your system as you scale.

What's Next?

This is just the beginning. Modern systems also need to consider:

Different message queue technologies (Kafka vs RabbitMQ vs AWS SNS)
gRPC vs HTTP for microservice communication
Pub/Sub patterns for event-driven architecture
Edge computing for global scale

The fundamentals we covered here - understanding sync vs async, using message queues for decoupling, and WebSockets for real-time communication - form the foundation for all these more advanced concepts.

Remember: great architecture isn't about using the latest technology. It's about understanding your users' needs, measuring your system's performance, and choosing the right patterns to deliver a fast, reliable experience.

Want to dive deeper into any of these concepts? Let's continue the conversation about building systems that scale.

DEV Community