Manychat Engineering for Manychat

Posted on Jun 16 • Originally published at Medium on Mar 5

How to survive LLM Traffic Spikes in Python

#softwareengineering #ai #python

What it takes to route, rate-limit, and failover hundreds of LLM calls per second without breaking production.

At Manychat, we serve AI-powered automation to thousands of Instagram and messaging accounts. Behind that experience sits our Python AI Service — a layer between our product and multiple LLM providers, handling hundreds of LLM calls per second in production.

It works well. Until it doesn’t.

LLM calls don’t behave like traditional API requests. They take seconds, not milliseconds. They’re expensive. They come with strict rate limits. And when a single LLM provider goes down, your feature can go down with it.

Horizontal scaling doesn’t solve this. Adding more servers won’t lift provider limits or fix upstream outages. What you actually need is a control layer — one that decides where traffic goes, when to back off, and how to fail without taking the entire system down.

I’m Sergi Porta, Python Team Lead in the Manychat AI unit. In this article, I’ll walk through the LLM traffic routing architecture we use in our Python AI service. I’ll explain the core gateway patterns for multi-provider LLM traffic, with a practical focus on failover logic, rate limiting, and monitoring, and show how this allows our AI service to handle hundreds of LLM calls per second in production while surviving spikes and provider outages.

The goal is simple: survive spikes of hundreds of LLM calls per second and provider outages without drama.

Python AI Service: Technical Stack

Before diving into routing, a quick look at the service itself. The Python AI Service is built with Python 3.13 and FastAPI , relying heavily on asyncio to handle long-running LLM calls and other I/O-heavy workloads.

We use SQLAlchemy and Alembic to manage configuration, metadata, and lifecycle state. Even though the service is focused on LLM traffic, it still needs to behave like any other production system: consistent, observable, and predictable.

Python AI Service architecture.

For reliability, we work with multiple providers (Azure and OpenAI) and we are planning to add more providers to satisfy the product needs in the future. That gives us flexibility — but also complexity.

Each provider behaves differently. Latency varies. Rate limits differ. Availability patterns are not the same. At our scale, ignoring those differences is not an option.

We need to monitor each deployment in real time, route traffic dynamically based on capacity, and recover automatically from partial outages.

The solution was to introduce a dedicated abstraction layer — a routing layer that hides provider-specific complexity from the rest of the application.

LLMRouting: Turning Providers into a Resilient Pool

It is built on top of LiteLLM’s Router library. Here’s how it works. We define multiple backend deployments, each of which may contain replicas of the same LLM model. Instead of calling a specific provider directly, the agent sends the request to the routing layer and simply specifies the model it wants to use. What happens behind the scenes is abstracted away from the application.

The first core mechanism is weighted routing. Not all deployments are equal. Some run on provisioned throughput tiers. Others are pay-as-you-go. Some are cheaper. Some are faster. We assign each deployment a numeric weight, which determines how much traffic it receives. The higher the weight, the larger the share of requests.

Rate limits are inevitable at scale. When a deployment starts returning 429 responses during a traffic spike, the router doesn’t stubbornly retry the same endpoint. It shifts traffic to other healthy deployments in the pool.

If a deployment becomes fully unavailable, it enters a cooldown period. During that time, it is temporarily removed from rotation, and the remaining backends absorb the traffic.

This logic applies not only to deployments, but to entire providers. If Azure experiences an outage, traffic can be routed directly to OpenAI. Because the same model alias exists across providers, failover happens within the retry window. The result: even a full provider outage doesn’t immediately cause errors for users.

Now let’s look at the mechanisms that make this work in practice: weighted routing, rate-limit handling, cooldowns, and fallbacks.

Weighted Routing

Each model alias — for example, gpt-4o-mini — maps to multiple deployments across different providers. Every deployment has a numeric weight that determines its share of traffic.

In our current production setup, the primary Azure deployment carries a weight of 8 (about 73% of requests). A secondary Azure deployment carries a weight of 2 (roughly 18%). A direct OpenAI fallback has a weight of 1 (around 9%).

Here’s how that distribution looks conceptually:

The router uses a weighted random selection strategy (simple shuffle). Most requests are directed to provisioned-throughput tiers, while pay-as-you-go tiers remain warm and ready.

Traffic distribution isn’t hardcoded. It’s defined in YAML. That means we can rebalance weights or shift traffic across providers within seconds without deploying new code.

Handling Rate Limits

When a deployment returns a 429 rate-limit response, the router does not retry the same endpoint. Instead, it immediately selects another deployment from the pool and retries the request — up to four attempts in total (1 original and 3 retries). Because each model alias maps to multiple backends, a rate limit in one Azure region is usually resolved by routing the retry to another Azure deployment or directly to OpenAI.

Every rate-limit event is tracked per backend through a custom Prometheus callback. Grafana dashboards make it immediately visible when a deployment is approaching its capacity ceiling. That visibility allows us to adjust routing weights proactively instead of reacting to outages after they cascade.

Cooldowns: Isolating Failing Deployments

Cooldowns prevent failing deployments from absorbing traffic they can’t serve. When a deployment crosses a failure threshold, the router removes it from the routing pool for a defined time window. During that period, only healthy deployments receive traffic.

After the cooldown window expires, the deployment is reintroduced into rotation. This isolation is critical during partial outages. Instead of spreading failures across all incoming requests, the system converges on healthy endpoints within seconds.

Fallbacks Across Deployments and Providers

Fallbacks operate at both the routing and application levels.

At the routing layer, if retries on a primary deployment are exhausted, traffic shifts to the remaining tiers, including cross-provider fallbacks. Because the same model alias exists across providers, even a full regional outage does not require manual intervention. The router reroutes traffic within the retry window.

At the application level, an additional safety net handles edge cases such as empty-content responses. Before surfacing an error to the user, the service retries the entire call. In practice, this means that even during provider-side instability, traffic can be rerouted in under a second without visible degradation for the end user.

Monitoring and Observability: Seeing Problems Before Users Do

Routing and failover logic are only as good as your visibility into them. Our observability stack relies on two core systems: Prometheus and Grafana for real-time metrics and alerting, and OpenTelemetry for distributed tracing across the full request lifecycle.

Prometheus Metrics and Grafana Dashboards

Every LLM call passing through the router is instrumented via a custom Prometheus callback.

We record high-granularity metrics at both the model and backend levels — enough detail to understand not just that something is wrong, but where.

Model-level metrics include total call counts, latency distributions, token usage (prompt and completion), and error rates. Metrics are labeled by model alias and agent name, allowing us to isolate the performance of specific features, for example, intent detection versus flow generation.

Backend-level metrics record which deployment handled each request and categorize the outcome into a controlled set: success, timeout, rate_limit, api_error, and other. Keeping this taxonomy small helps maintain manageable Prometheus cardinality while still providing enough signal to diagnose routing behavior.

LLM Providers Metrics weights, error, latency dashboard.

These metrics feed into dedicated Grafana dashboards that help us answer four critical questions:

1. Is traffic distributed as expected?

We verify that routing weights are respected and detect unexpected shifts caused by cooldowns or failovers.

2. Is latency degrading anywhere?

P50, P95, and P99 histograms are broken down by backend to surface provider-specific slowdowns.

3. Are errors isolated or systemic?

Outcome breakdowns show whether failures are limited to a single deployment or spreading across the pool.

4. Is cost drifting?

Token counters per model and agent help detect prompt regressions and unexpected usage spikes.

Call counts and latency dashboard.

Errors dashboard.

Cost analysis dashboard.

Alerting System

Dashboards are useful for investigation. Alerts are what trigger action.

Slack notifications fire under two primary conditions:

P95 latency thresholds. Alerts trigger when latency exceeds defined limits (typically between 3.5 and 5 seconds, depending on the model). This helps catch provider slowdowns before users feel them.

Error rate breaches. An error rate above a certain threshold triggers an immediate notification. At our traffic level, that’s not a minor glitch — it’s a strong signal of an outage or misconfiguration that requires attention.

Monitoring the Asyncio Event Loop

For high-throughput asynchronous services, the health of the Python event loop is monitored via two dedicated metrics:

Event Loop Delay. This measures the gap between expected and actual asyncio.sleep intervals. Spikes above 1 ms usually indicate CPU-bound work or blocking calls that are starving the loop — and increasing LLM response latency.

Active task count. Tracking the number of running tasks helps detect backpressure caused by slow upstream responses or sudden spikes in concurrency.

Distributed Tracing with OpenTelemetry

While metrics provide high-level status, OpenTelemetry provides the context needed for deep investigation. The service automatically analyses three layers: FastAPI (HTTP requests), OpenAI API calls, and SQLAlchemy (database queries).

Each trace spans the full lifecycle of a request — from the initial HTTP call through intent detection, embedding generation, database lookups, and final LLM completion. We propagate custom attributes via OpenTelemetry baggage to preserve business context:

manychat.account_id: links spans to specific customer accounts.
manychat.session_id: associates spans with unique automation sessions.

These attributes let us pinpoint whether a bottleneck originated in an LLM backend, an embedding call, or a database query for a specific request. Traces are exported via OTLP gRPC and stored in S3 for long-term analysis.

Observability Checklist for Production LLM Routing

The following points define the requirements for the production LLM routing layer:

Backend call counts and outcomes to verify routing weights and failover activation.
Latency histograms by deployment to isolate provider-side slowdowns.
Error classification to distinguish between expected rate limits and critical authentication or timeout issues.
Token usage tracking for cost management and identifying prompt regressions.
Event loop health monitoring to detect blocking calls or task backpressure.
Distributed traces with business context to correlate LLM performance with database and embedding latency.
Threshold-based alerts for P95 latency and error rate breaches.

What is next for Python AI Service and LLM Routing?

LLM Routing-based architecture is solid, and we’re evolving it further.

Next, we’re extending the routing layer with RPM/TPM-aware selection, latency-based routing, and cost optimization, so the system can automatically prefer the most efficient deployment available in real time.

Rate-limit handling will also evolve. We’re introducing exponential backoff, more granular retry policies per error type.

Cooldowns will become more refined. Instead of a single threshold, we’ll define explicit “allowed failure” counts and tailor cooldown durations to the type of error, distinguishing between expected rate-limit spikes and critical authentication failures.

Fallback logic will also be extended to span multiple model groups. For example, falling back from gpt-4o-mini to gpt-4o, or dynamically selecting models based on context window size and content policies.

We’re also thinking about integrating in the future other providers such as Anthropic, Gemini, and potentially self-hosted models.

One hundred LLM calls per second once felt ambitious. Now we’re preparing for thousands. But that’s a story for another article.

DEV Community