DEV Community

Cover image for Mastering API Throttling: Techniques and Best Practices for Optimal Performance
Ambassador
Ambassador

Posted on • Originally published at getambassador.io

Mastering API Throttling: Techniques and Best Practices for Optimal Performance

APIs are the foundation of the modern web. With a single URL and a few kilobytes of payload, you can access extremely powerful services and knit them together into a billion-dollar product.

But because of their ease of use and power, APIs are also open to extreme abuse and overuse. High volumes of requests can overwhelm servers, degrade performance, and lead to service outages. Without systems to mitigate this, any reasonably popular API will quickly become overwhelmed or see its production costs go through the roof.

This is where API throttling comes into play. API throttling allows API producers to limit the requests to their service and manage resource consumption, ensuring optimal performance and availability for all users. Let’s go into the details.

What is API Throttling?

API throttling controls the rate at which client applications can access an API, usually within a specified time frame. This helps manage server resources, prevent abuse, and ensure fair usage among all API consumers. At its core, API throttling works by tracking API requests and enforcing predefined limits. When a client exceeds these limits, the API responds with an error, delays the request, or queues it for later processing.

There are several types of API throttling. Here, we’ll go through the main techniques, their characteristics, and use cases.

Understanding Rate Limiting: A Simple Yet Effective Throttling Strategy

Rate limiting is the most straightforward form of API throttling, allowing a fixed number of requests within a specified time window. It's widely used due to its simplicity and effectiveness in controlling API traffic. This method is particularly useful for APIs with consistent usage patterns or where an explicit upper bound on request frequency can be defined. However, it may not be optimal for APIs with highly variable traffic or complex resource requirements.

*Here’s how you might implement rate-limiting in Python:
*

from time import time

class RateLimiter:
    def __init__(self, limit, window):
        self.limit = limit
        self.window = window
        self.tokens = {}

    def is_allowed(self, client_id):
        now = time()
        if client_id not in self.tokens:
            self.tokens[client_id] = []
        self.tokens[client_id] = [t for t in self.tokens[client_id] if t > now - self.window]
        if len(self.tokens[client_id]) < self.limit:
            self.tokens[client_id].append(now)
            return True
        return False

Enter fullscreen mode Exit fullscreen mode

In this, the RateLimiter class is initialized with two parameters:

limit: The maximum number of requests allowed within the time window.
window: The duration of the time window in seconds.
The tokens dictionary stores the timestamp of each request for each client, and the is_allowed method determines whether a new request from a given client_id should be allowed. It first gets the current timestamp (now). If this is the first request from the client, it initializes an empty list for that client in the tokens dictionary, then filters the list of timestamps for the client, keeping only those within the current time window. If the number of requests (timestamps) within the window is less than the limit, the request is allowed, and the current timestamp is appended to the client's list. If the number of requests has reached the limit, it returns False, denying the request.

Let’s say you are building a public weather API that allows 1000 requests per hour per API key. This ensures fair usage among free-tier users while preventing any single client from overwhelming the service. To use this rate limiter in a Flask API, for example, you might do something like this:

`from flask import Flask, request, jsonify

app = Flask(name)
limiter = RateLimiter(limit=1000, window=3600) # 1000 requests per hour

@app.route('/api/resource')
def get_resource():
client_id = request.remote_addr # Use IP address as client ID
if limiter.is_allowed(client_id):
# Process the request
return jsonify({"status": "success", "data": "Your resource data"})
else:
return jsonify({"status": "error", "message": "Rate limit exceeded"}), 429

if name == 'main':
app.run()`

This creates a rate limiter, allowing 1000 requests per hour for each client IP address. When the limit is exceeded, it returns a 429 (Too Many Requests) status code.

The benefits of rate limiting are:

  • Simple to implement and understand
  • Predictable behavior for clients
  • Effective at preventing basic forms of API abuse

However, it can be too rigid for applications with varying traffic patterns, doesn't account for request complexity or server load, and may lead to inefficient resource use during low-traffic periods.

Concurrent Request Limiting

Concurrent request limiting restricts the number of simultaneous requests a client can make, regardless of the total requests over time. This method is particularly effective for managing resources with fixed concurrency limits, such as database connections or processing threads. However, it may not be suitable for all types of APIs, especially those with varying request processing times.

Here's an implementation of concurrent request limiting:

`import threading

class ConcurrentLimiter:
def init(self, max_concurrent):
self.max_concurrent = max_concurrent
self.current = {}
self.lock = threading.Lock()

def acquire(self, client_id):
    with self.lock:
        if client_id not in self.current:
            self.current[client_id] = 0
        if self.current[client_id] < self.max_concurrent:
            self.current[client_id] += 1
            return True
        return False

def release(self, client_id):
    with self.lock:
        if client_id in self.current and self.current[client_id] > 0:
            self.current[client_id] -= 1
Enter fullscreen mode Exit fullscreen mode

`
In this implementation, the ConcurrentLimiter class is initialized with one parameter: max_concurrent, which specifies the maximum number of concurrent requests allowed per client.

The current dictionary keeps track of the number of ongoing requests for each client. The acquire method attempts to gain a "slot" for a new request, while the release method frees up a slot when a request is completed. The threading.Lock ensures thread-safety in multi-threaded environments.

Let's consider an image processing API that limits each client to 5 concurrent requests. This ensures that the GPU resources are shared fairly among all users, preventing any single client from monopolizing the processing power. Here's how you might use this limiter in an API:

from flask import Flask, request, jsonify

`app = Flask(name)
limiter = ConcurrentLimiter(max_concurrent=5)

@app.route('/api/process-image', methods=['POST'])
def process_image():
client_id = request.remote_addr
if limiter.acquire(client_id):
try:
# Process the image
result = process_image_function(request.data)
return jsonify({"status": "success", "data": result})
finally:
limiter.release(client_id)
else:
return jsonify({"status": "error", "message": "Too many concurrent requests"}), 429

if name == 'main':
app.run()`

This creates a concurrent limiter, allowing five simultaneous requests per client IP address. When the limit is exceeded, it returns a 429 code.

*The benefits of concurrent request limiting are:
*

Prevents clients from overwhelming the server with parallel requests
Useful for managing resources that have a fixed concurrency limit
Can help maintain low latency for all clients

However, it may underutilize server resources if requests have varying processing times, can be complex to implement in distributed systems, and might not prevent abuse if clients quickly make sequential requests.

Token Bucket Algorithm

The token bucket algorithm uses a metaphorical "bucket" that continuously fills with tokens at a fixed rate. Each request consumes a token; requests are only allowed if tokens are available. This method allows for short bursts of traffic while maintaining a long-term rate limit, making it more flexible than simple rate limiting.

Here's an implementation of the token bucket algorithm:

`import time

class TokenBucket:
def init(self, capacity, fill_rate):
self.capacity = capacity
self.fill_rate = fill_rate
self.tokens = capacity
self.last_fill = time.time()

def consume(self, tokens=1):
    now = time.time()
    tokens_to_add = (now - self.last_fill) * self.fill_rate
    self.tokens = min(self.capacity, self.tokens + tokens_to_add)
    self.last_fill = now

    if self.tokens >= tokens:
        self.tokens -= tokens
        return True
    return False`
Enter fullscreen mode Exit fullscreen mode

In this implementation, the TokenBucket class is initialized with capacity, the maximum number of tokens the bucket can hold, and fill_rate, the rate at which tokens are added to the bucket (tokens per second).

The consume method attempts to consume a specified number of tokens (default 1) for a request. It first calculates how many tokens should be added based on the time elapsed since the last fill, adds those tokens (up to the capacity), and then checks if there are enough tokens for the request.

Consider a stock market data API that uses a token bucket with a capacity of 100 tokens and a refill rate of 10 tokens per second. This allows clients to make quick bursts of requests during market-moving events while still maintaining an average rate limit. Here's how you might use this in such an API:
`
from flask import Flask, request, jsonify

app = Flask(name)
bucket = TokenBucket(capacity=100, fill_rate=10)

@app.route('/api/stock-data')
def get_stock_data():
if bucket.consume():
# Fetch and return stock data
return jsonify({"status": "success", "data": "Stock data here"})
else:
return jsonify({"status": "error", "message": "Rate limit exceeded"}), 429

if name == 'main':
app.run()`

This creates a token bucket that allows an average of 10 requests per second and can burst up to 100 requests. When the bucket is empty, it returns a 429 code.

*The benefits of the token bucket algorithm are:
*

Allows for short bursts of traffic, providing flexibility for clients
Can be easily adjusted by changing bucket size or token refill rate
Offers a good balance between strict rate limiting and allowing occasional spikes.

However, it's more complex to implement and understand compared to simple rate limiting, can be challenging to tune for optimal performance, and may still allow sustained high rates if not configured correctly.

Leaky Bucket Algorithm

The leaky bucket algorithm processes requests at a fixed rate, using a queue to handle incoming requests that exceed this rate. Unlike the token bucket algorithm, which allows for bursts of traffic up to the bucket's capacity, the leaky bucket algorithm enforces a strictly consistent outflow rate. This makes the leaky bucket particularly well-suited for scenarios where a steady, predictable rate of requests is crucial, such as in traffic shaping or when interfacing with systems that have strict rate requirements.

Here's an implementation of the leaky bucket algorithm:

`import time
from collections import deque

class LeakyBucket:
def init(self, capacity, leak_rate):
self.capacity = capacity
self.leak_rate = leak_rate
self.bucket = deque()
self.last_leak = time.time()

def add(self):
    now = time.time()
    leak_amount = int((now - self.last_leak) * self.leak_rate)
    for _ in range(min(leak_amount, len(self.bucket))):
        self.bucket.popleft()
    self.last_leak = now

    if len(self.bucket) < self.capacity:
        self.bucket.append(now)
        return True
    return False`
Enter fullscreen mode Exit fullscreen mode

In this implementation, the LeakyBucket class is initialized with two parameters:

capacity: The maximum number of requests that can be queued.
leak_rate: The rate at which requests are processed (requests per second).

The add method attempts to add a new request to the bucket. It first "leaks" any requests that should have been processed since the last check, then adds the new request if there's space in the bucket.

Let's consider an email service that uses a leaky bucket algorithm to limit outgoing emails to 100 per minute. This ensures a steady flow of emails, preventing email providers from flagging the service as a spam source. Here's an API using LeakyBucket:

`from flask import Flask, request, jsonify

app = Flask(name)
bucket = LeakyBucket(capacity=100, leak_rate=100/60) # 100 emails per minute

@app.route('/api/send-email', methods=['POST'])
def send_email():
if bucket.add():
# Send the email
return jsonify({"status": "success", "message": "Email queued for sending"})
else:
return jsonify({"status": "error", "message": "Rate limit exceeded, try again later"}), 429

if name == 'main':
app.run()
`
This example creates a leaky bucket with a capacity of 100 and a leak rate of 100/60 (approximately 1.67) requests per second, effectively limiting it to 100 emails per minute. When it is full, the bucket returns a 429 (Too Many Requests) status code.

*The benefits of the leaky bucket algorithm are:
*

  • Smooths out traffic spikes, providing a consistent outflow of requests
  • Useful for rate-limiting outgoing traffic (e.g., in a web crawler or email sender)
  • Can help in scenarios where maintaining a steady request rate is crucial However, it may introduce additional latency for bursty traffic patterns, can be memory-intensive if the bucket size is large, and is not ideal for scenarios requiring immediate response to traffic spikes.

Dynamic Throttling

Dynamic throttling adjusts limits based on current server load or other real-time factors. This method is the most flexible and can potentially make the most efficient use of server resources, but it's also the most complex to implement and tune effectively.

Here's an implementation of dynamic throttling:

import time
import psutil

class DynamicThrottler:
def init(self, base_limit, max_limit):
self.base_limit = base_limit
self.max_limit = max_limit
self.current_limit = base_limit
self.requests = {}

def is_allowed(self, client_id):
    now = time.time()
    cpu_usage = psutil.cpu_percent()
    self.current_limit = self.base_limit + (self.max_limit - self.base_limit) * (1 - cpu_usage / 100)

    if client_id not in self.requests:
        self.requests[client_id] = []
    self.requests[client_id] = [t for t in self.requests[client_id] if t > now - 60]

    if len(self.requests[client_id]) < self.current_limit:
        self.requests[client_id].append(now)
        return True
    return False
Enter fullscreen mode Exit fullscreen mode

In this implementation, the DynamicThrottler class is initialized with two parameters:

base_limit: The minimum number of requests allowed per minute.
max_limit: The maximum number of requests allowed per minute.
The is_allowed method checks the current CPU usage and adjusts the current limit accordingly. It then checks if the client has exceeded this limit in the last minute.

Consider a cloud-based machine learning API that dynamically adjusts its rate limits based on current GPU utilization. During periods of low usage, it allows more requests per client but tightens restrictions as the system load increases. Here's how you might use this in an API:

from flask import Flask, request, jsonify
`
app = Flask(name)
throttler = DynamicThrottler(base_limit=10, max_limit=100)

@app.route('/api/ml-predict', methods=['POST'])
def ml_predict():
client_id = request.remote_addr
if throttler.is_allowed(client_id):
# Perform ML prediction
return jsonify({"status": "success", "data": "Prediction result"})
else:
return jsonify({"status": "error", "message": "Rate limit exceeded"}), 429

if name == 'main':
app.run()
`
This example creates a dynamic throttler that allows between 10 and 100 requests per minute, depending on the current system load. When the limit is exceeded, it returns a 429 status code.

The benefits of dynamic throttling are:

Adapts to changing server conditions in real-time
Can maximize resource utilization during low-load periods
Provides better user experience by allowing more requests when possible.

However, it's complex to implement and tune effectively, can lead to unpredictable behavior for clients, and requires careful monitoring and adjustment to prevent oscillation.

Hard vs. Soft Throttling

API throttling can be implemented as either hard throttling or soft throttling. Hard throttling strictly enforces the request limit. Once the limit is reached, all subsequent requests are rejected until the next time window. This implementation typically uses a counter that resets at fixed intervals and returns an HTTP 429 (Too Many Requests) status code when the limit is exceeded. The benefits of hard throttling are:

  • Simple to implement and understand
  • Provides predictable and consistent behavior
  • Effectively prevents API abuse

However, if clients do not appropriately handle this approach, it can frustrate users during traffic spikes and lead to lost data or failed operations. You’re more likely to need hard throttling when enforcing request limits for free-tier users in a freemium model or when protecting core infrastructure from overload in high-stakes environments like financial systems

Soft throttling, however, allows for exceeding the limit to a certain degree, depending on current server capacity. Thus, it usually uses a combination of counters and server load metrics. It queues the requests and processes them at a reduced rate.

The pros here are:

  • More flexible and user-friendly
  • Can handle traffic spikes more gracefully
  • Allows for better resource utilization

Soft throttling is more complex to implement and tune and less predictable for clients, but it is good for APIs that experience predictable load variations (e.g., higher during business hours, lower at night) or with services running on scalable cloud infrastructure that can handle some degree of overload.

In practice, many API providers use hard and soft throttling techniques to balance system protection with user experience. The choice between them often depends on the specific use case, infrastructure capabilities, and business requirements.

Best Practices in API Throttling

If you decide to use API throttling in an API, a few considerations can make your API better for producers and consumers.

*Use granular rate limits: * Implement multiple tiers of rate limits (e.g., per second, minute, hour, and day) to provide fine-grained control. This approach helps prevent short-term spikes while allowing for higher long-term usage. It can also be tailored to different API endpoints based on resource requirements.

*Leverage distributed rate limiting: * Using a centralized data store (e.g., Redis) to maintain rate limit counters across multiple API servers in distributed systems. Implement a lua script for atomic increment-and-check operations to ensure race condition-free rate limiting in high-concurrency environments.

Provide clear rate limit information in responses: Include rate limit details in API response headers (e.g., X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset). This allows clients to self-regulate their request rates and implement effective retry strategies, reducing unnecessary load on your servers.

Implement circuit breakers for downstream services: Use the circuit breaker pattern to prevent cascading failures when rate limits are exceeded for critical downstream services. Configure thresholds for error rates or response times, automatically stopping requests to overloaded services and gradually reopening the circuit as conditions improve.
**
Employ request prioritization and queue management: **
Implement a priority queue system for incoming requests when approaching rate limits. Assign priority levels based on factors like client tier, request type, or business importance. Use algorithms like Weighted Fair Queuing (WFQ) to ensure that high-priority requests are processed first during high-load periods while preventing starvation of lower-priority requests.

Implement intelligent retry mechanisms: Design your API to provide guidance on retry behavior when rate limits are exceeded. Include a “Retry-After” header in rate limit error responses, indicating the time the client should wait before retrying. For more advanced implementations, consider using exponential backoff algorithms with jitter to prevent thundering herd problems when multiple clients attempt to retry simultaneously after a rate limit period expires.

Integration with API Gateways and Management Platforms

When implementing API throttling, integrating it with API gateways and management platforms is often more efficient than building it directly into your API. These platforms provide robust, configurable throttling capabilities that can be managed centrally. Let's explore how this integration works, focusing on the Edge Stack as an example.

A Kubernetes API gateway like Edge Stack offer built-in rate limiting functionality that can be easily configured and adjusted without modifying your core API code. This separation of concerns allows for more flexible and scalable throttling policies.

In Edge Stack, rate limiting is composed of two main parts:
RateLimitService Resource: This tells Edge Stack what external service to use for rate limiting. If the rate limit service is unavailable, Edge Stack can be configured to either allow the request (fail open) or block it (fail closed).

Labels: These are metadata attached to requests, used by the RateLimitService to determine which limits to apply. Labels can be set on individual Mappings or globally in the Edge Stack Module.
To configure throttling policies through Edge Stack, first define a RateLimitService:

apiVersion: getambassador.io/v3alpha1
kind: RateLimitService
metadata:
name: ratelimit
spec:
service: "ratelimit-example:5000"
protocol_version: v3
domain: ambassador
failure_mode_deny: true

This configuration specifies the rate limit service to use, the protocol version, the domain for labeling, and the failure mode. Then you configure mappings with rate limit labels:

apiVersion: getambassador.io/v2
kind: Mapping
metadata:
name: quote-backend
spec:
prefix: /backend/
service: quote
labels:
ambassador:
- request_label_group:
- x-ambassador-test-allow:
header: x-ambassador-test-allow
omit_if_not_present: true

This Mapping configuration adds a label group that will be used for rate limiting decisions.

The external rate limit service implements the actual rate limiting logic. Edge Stack sends gRPC shouldRateLimit requests to this service, which decides whether to allow or limit the request.

By integrating throttling with an API gateway like Edge Stack, you gain several advantages:

Centralized Management: Throttling policies can be managed in one place, separate from your API code.
Flexibility: Policies can be easily adjusted without redeploying your API.
Scalability: API gateways are designed to handle high throughput and can apply rate limiting efficiently.
**Advanced Features: **Many gateways offer additional features like gradual throttling, burst handling, and per-client limits.

Top comments (0)