Nilofer 🚀

Posted on Apr 22 • Edited on Apr 27

Low-Latency Model Router: Automatic LLM Selection Across OpenRouter

#machinelearning #ai #llm #opensource

When calling an LLM API directly, the model selection is typically fixed ahead of time. In practice, this creates several limitations across different workloads.
Latency varies depending on the model and request type. Lower-cost models may not maintain quality under certain conditions. External API failures require fallback handling. Repeated identical requests increase cost if caching is not applied.
These constraints require a routing layer that can dynamically select models based on latency, cost, and quality, while also handling caching, fallback, and observability.

What This Project Does

This project implements a low-latency LLM router that dynamically selects the best model for each request based on latency, cost, and quality.

Instead of sending every request to a fixed model, the router evaluates multiple candidates at runtime and routes each request to the most suitable option.

How the Scoring Engine Works

The scoring engine is the component responsible for evaluating available models and selecting the most suitable one for each request.

It assigns a score to each model based on latency, cost, and quality, and then selects the model with the highest score according to the defined priority.

Every model in the catalogue is scored on three dimensions. A routing decision is a single pass over the catalogue to find the highest-scoring candidate given your weight preferences:

Score = w_latency * (1 - norm_latency)
      + w_cost    * (1 - norm_cost)
      + w_quality * quality_score

You control the weights via the priority field:

Model Catalogue

If the selected model fails, the router automatically retries with the next-best candidate. Identical requests are served from cache. Redis, or in-memory if Redis is unavailable.

Project Structure

ml_project_0652/
├── src/
│   ├── models.py              # Pydantic schemas
│   ├── router/
│   │   ├── core.py            # Weighted scoring engine + model catalogue
│   │   ├── metrics.py         # Rolling-window metrics tracker
│   │   ├── openrouter.py      # Async OpenRouter API client
│   │   └── cache.py           # Redis cache + MockCache fallback
│   ├── api/
│   │   ├── main.py            # FastAPI app
│   │   └── routes.py          # Route definitions
│   └── cli/
│       └── commands.py        # Typer CLI
├── tests/                     # 29 unit + integration tests
├── start_router.py            # Server entry point
├── config.yaml                # Server, Redis, and routing config
├── .env.example               # Environment variable template
└── requirements.txt

REST API

curl -X POST http://localhost:8000/route \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "priority": "balanced"
  }'

Example response:

{
  "id": "gen-abc123",
  "model": "google/gemini-flash-1.5",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Paris."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 3,
    "total_tokens": 17
  },
  "routing_decision": {
    "selected_model": "google/gemini-flash-1.5",
    "reason": "Best composite score based on latency, cost, and quality"
  },
  "latency_ms": 312.4,
  "cached": false
}

Route with speed priority and latency cap:

curl -X POST http://localhost:8000/route \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Translate: hello"}],
    "priority": "speed",
    "max_latency_ms": 700
  }'

All Endpoints

Caching

Identical requests are cached using a hash of the request.
Redis configuration:

redis:
  host: "localhost"
  port: 6379
  ttl_seconds: 3600

If Redis is unavailable, the system automatically falls back to in-memory caching.

Fallback

Fallback models are defined in config.yaml:

routing:
  fallback_models:
    - "openai/gpt-4o-mini"
    - "anthropic/claude-3-haiku"
    - "google/gemini-flash-1.5"

Metrics

The system tracks:

average latency
p95 latency
p99 latency
per-model usage
cache hit rate

Available via /metrics.

CLI

# List models
python -m src.cli.commands models

# Preview routing decision
python -m src.cli.commands route "What is 2+2?" --dry-run

# Route with quality priority
python -m src.cli.commands route "Summarize this article" --priority quality --dry-run

# Route with latency cap
python -m src.cli.commands route "Hello" --priority speed --max-latency 600 --dry-run

# Live call
python -m src.cli.commands route "What is 2+2?" --priority balanced

# Benchmark
python -m src.cli.commands benchmark --iterations 10

Configuration

server:
  host: "0.0.0.0"
  port: 8000

redis:
  host: "localhost"
  port: 6379
  ttl_seconds: 3600

routing:
  default_weights:
    latency: 0.4
    cost: 0.3
    quality: 0.3
  fallback_models:
    - "openai/gpt-4o-mini"
    - "anthropic/claude-3-haiku"
    - "google/gemini-flash-1.5"

How I Built This Using NEO

I used NEO AI Engineer to build this project by starting with a high-level description of the system requirements.

FYI: Neo is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The goal was to create a routing layer that can dynamically select LLMs based on latency, cost, and quality, while also supporting caching, fallback handling, and metrics tracking.

I began by giving this task prompt to Neo:

Build a FastAPI LLM router that selects models from OpenRouter based on latency, cost, and quality. Include weighted scoring, fallback handling, caching, and metrics tracking.

From this prompt, NEO generated the initial project structure, including the API layer and routing logic.

It then produced the core components required for the system, such as:

Model selection based on weighted scoring
Request handling through the API layer
Caching support with Redis and in-memory fallback
Fallback handling for failed model calls
Metrics tracking for latency and usage

These pieces came together as a working router that could process requests, select models based on defined priorities, and return responses with routing decisions and metrics.

This made it possible to move from a high-level idea to a functioning system without manually implementing each part of the pipeline.

How to Extend This Further with NEO

Once the base system is in place, NEO can also be used to iterate on specific components.

You can extend this project with more functionality such as:

Adjusting scoring weights for different workloads
Refining model selection strategies
Modifying cache policies and TTL behavior
Adding constraints such as latency limits or budget caps
Integrating additional model providers

Running the Project

git clone https://github.com/dakshjain-1616/low-Latency-Model-Router
cd low-Latency-Model-Router
pip install -r requirements.txt
cp .env.example .env
python start_router.py

Ensure that you specify your OpenRouter API Key in .env while running the low latency model router.

Final Notes

This router implements a routing layer that dynamically selects models based on latency, cost, and quality while handling caching, fallback, and observability.

It separates routing logic from model usage, allowing systems to adapt across different workloads without changing application logic.

The code is at https://github.com/dakshjain-1616/low-Latency-Model-Router
You can also build with NEO in your IDE using the VS Code extension or Cursor.