Neweraofcoding

Posted on Apr 14

Scaling to Billions: The Backend Behind Large Systems

#webdev

Going from zero to your first thousand users is an exhilarating milestone. But what happens when that thousand turns into a million, and eventually, a billion? The systems that worked flawlessly for a small user base will inevitably buckle, bottleneck, and break under the immense pressure of global, concurrent traffic.

In the world of high-scale engineering, building for billions requires a fundamental shift in how we think about architecture, data, and performance. It demands abandoning a single monolithic structure in favor of resilient, distributed ecosystems.

Here is a deep dive into the core strategies tech giants use to scale their backends to handle astronomical traffic, focusing on the three pillars of scale: distributed systems, database sharding, and caching.

1. Embracing Distributed Systems: From Monoliths to Fleets

When a traditional web application starts to slow down, the instinct is often to practice vertical scaling (scaling up)—buying a more expensive server with more CPU and RAM. However, vertical scaling has a hard physical limit and becomes cost-prohibitive.

To reach a billion users, you must rely on horizontal scaling (scaling out)—distributing the load across hundreds or thousands of smaller, commodity servers. This is the foundation of a distributed system.

Key Concepts in Distributed Architecture:

Stateless Services: For horizontal scaling to work seamlessly, backend API servers must be stateless. This means no single server holds unique session data. Any server in the fleet should be able to handle any incoming request. Session state is offloaded to a shared, high-speed datastore (like Redis).
Load Balancing: A fleet of servers requires a traffic cop. Load balancers distribute incoming network traffic across multiple backend servers to ensure no single machine is overwhelmed, significantly improving responsiveness and availability.
Microservices (with caution): Large teams often break a monolith down into microservices—independent services handling specific domains (e.g., Billing, User Auth, Notifications). While this allows isolated scaling and faster deployments, it introduces network latency and complex failure modes.

2. Database Sharding: Dividing and Conquering Data

Compute is relatively easy to scale horizontally. State—the database—is where the real engineering challenges begin. A single primary database can only handle so many concurrent reads and writes before it becomes the ultimate bottleneck.

When read replicas (copies of the database that handle read-only queries) are no longer enough to handle the write load, the solution is sharding.

How Sharding Works

Sharding is the process of horizontally partitioning your database. Instead of holding all 1 billion user records in one massive database, you split them across 100 separate databases (shards), each holding 10 million users.

The Shard Key: This is the most critical decision in a distributed database. The shard key determines how data is distributed. For example, if you shard by user_id, user 123 might live on Shard A, while user 456 lives on Shard B.
The Danger of Hotspots: If you choose a poor shard key—like country—and 80% of your users are in one country, that specific shard will be overwhelmed while the others sit idle. This is known as a database hotspot. A good shard key ensures an even distribution of data and traffic.

The Trade-off: Sharding solves write bottlenecks, but it makes cross-shard operations painfully difficult. Performing a standard SQL JOIN across two different physical databases is slow and complex, often forcing engineers to handle data aggregation at the application level.

3. Caching Strategies: The Fastest Query is the One You Don't Make

Even with a perfectly sharded database, hitting the disk for every user request is too slow for global scale. Caching is the secret weapon of high-performance systems. By storing frequently accessed data in high-speed, volatile memory (RAM), you bypass the database entirely.

Layers of Caching

Content Delivery Networks (CDNs): The first line of defense. CDNs cache static assets (images, videos, JavaScript) at edge servers physically located near the user. If a user in Tokyo requests a video, it is served from a server in Tokyo, not a data center in Virginia.
In-Memory Datastores (Redis/Memcached): Used for caching dynamic data, session states, and API responses.

Common Caching Patterns

Cache-Aside (Lazy Loading): The application first checks the cache. If the data is there (a cache hit), it returns it immediately. If not (a cache miss), it queries the database, saves the result to the cache for next time, and then returns it.
Write-Through: Every time data is written to the database, it is simultaneously written to the cache. This ensures the cache is never stale, though it adds slight latency to write operations.

The hardest part of caching isn't setting it up—it is cache invalidation. Knowing exactly when to delete or update cached data so users don't see outdated information is notoriously one of the most difficult problems in computer science.

The Reality of Scale

Building for a billion users is not about finding a single silver-bullet technology. It is about identifying bottlenecks, introducing decoupling, gracefully degrading services during failures, and accepting trade-offs—like choosing eventual consistency over immediate consistency to gain massive availability.

Scale is a journey of continuous architectural evolution. The system you build for one million users will not be the system that serves one billion.

To make these architectural concepts concrete, let's look at how they are actually implemented in the codebase. Moving from theory to practice requires specific design patterns.

Here are the practical code snippets and relevant implementations that power the systems discussed in the previous sections.

1. Implementing the Cache-Aside Pattern

As mentioned in the caching section, the cache-aside (or lazy loading) pattern is the industry standard for reducing database load. The application is responsible for reading from and writing to the cache.

Here is how this looks in Python using Redis:

import redis
import json

# Connect to Redis cluster
cache = redis.Redis(host='redis.internal.network', port=6379, db=0)

def get_user_profile(user_id):
    cache_key = f"user_profile:{user_id}"

    # 1. Check the cache first (Cache Hit)
    cached_data = cache.get(cache_key)
    if cached_data:
        print("Served from Redis Cache!")
        return json.loads(cached_data)

    # 2. If not in cache (Cache Miss), query the primary database
    print("Cache miss. Hitting the database...")
    user_data = db.execute("SELECT * FROM users WHERE id = %s", (user_id,))

    # 3. Populate the cache for the next request
    if user_data:
        # Crucial: Always set a TTL (Time To Live) so stale data eventually expires
        cache.setex(cache_key, 3600, json.dumps(user_data)) # Caches for 1 hour

    return user_data

Why this is relevant: Notice the setex function. Setting a TTL (Time To Live) is mandatory at scale. Without it, your Redis cluster will eventually run out of memory, and updates to the user's profile in the database will never be reflected in the cache.

2. Application-Level Shard Routing

When you shard a database, your application needs to know which database to connect to before it can execute a query. This is called routing logic.

Here is a simplified example in Node.js using a Modulo Sharding strategy, where the shard is determined by dividing the user ID by the total number of shards and using the remainder:

// A registry of our physical database shards
const DB_SHARDS = [
    'postgres://shard-00.db.internal/users',
    'postgres://shard-01.db.internal/users',
    'postgres://shard-02.db.internal/users',
    'postgres://shard-03.db.internal/users'
];

/**
 * Determines which physical database holds the user's data
 */
function getDbConnectionForUser(userId) {
    // Modulo arithmetic to find the correct shard index
    const shardIndex = userId % DB_SHARDS.length;
    const connectionString = DB_SHARDS[shardIndex];

    console.log(`Routing User ${userId} to Shard ${shardIndex}`);
    return new DatabasePool(connectionString);
}

// Execution
async function fetchUser(userId) {
    // 1. Get the correct shard connection
    const db = getDbConnectionForUser(userId);

    // 2. Query that specific shard
    const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
    return user;
}

// User 10452 % 4 = Shard 0
// User 10453 % 4 = Shard 1

The danger of this approach: While modulo sharding is fast and easy to write, it has a fatal flaw known as the Resharding Problem. If you need to add a 5th shard to handle more traffic, userId % 5 will yield completely different results than userId % 4. This means almost all your data will suddenly be on the "wrong" shard and must be physically migrated. Modern systems often use Consistent Hashing or directory-based routing (a lookup table) to mitigate this.

3. Protecting the Backend: Rate Limiting

Scaling isn't just about handling legitimate traffic; it is about surviving malicious traffic, scrapers, and accidental DDoS attacks from your own frontend retrying failed requests.

At a massive scale, APIs must protect themselves using Rate Limiting. Redis is frequently used for this due to its atomic operations.

Here is an example of a Token Bucket rate limiter in Python:

import time
import redis

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def is_rate_limited(user_id, limit=100, window_in_seconds=60):
    """
    Allows a user to make 'limit' requests per 'window_in_seconds'.
    """
    current_time = int(time.time())
    window_key = f"rate_limit:{user_id}:{current_time // window_in_seconds}"

    # INCR is an atomic operation in Redis. It increments the value and returns it safely 
    # even if thousands of requests hit this exact line simultaneously.
    request_count = redis_client.incr(window_key)

    if request_count == 1:
        # If it's the first request in this window, set the key to expire 
        # to clean up memory
        redis_client.expire(window_key, window_in_seconds)

    if request_count > limit:
        return True # User is rate limited

    return False # Request allowed

# API Middleware logic
if is_rate_limited(user_id=8847):
    return "HTTP 429: Too Many Requests", 429
else:
    return process_api_request()

Additional Relevancy: The Shift to Asynchronous Processing

When systems scale, you can no longer process everything synchronously during the HTTP request cycle. If a user uploads a video, you do not keep their browser loading for 10 minutes while you compress the file.

Large systems rely heavily on Message Queues (like RabbitMQ, Apache Kafka, or AWS SQS).

Producer: The API receives the request, saves the raw data, and pushes an event like {"task": "compress_video", "video_id": 123} to the queue. It immediately returns an HTTP 202 Accepted to the user.
Message Broker: Kafka holds the message safely.
Consumer (Worker): A separate fleet of backend servers constantly polls the queue. They pull the message, do the heavy CPU work in the background, and update the database when finished.

DEV Community