<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nilofer 🚀</title>
    <description>The latest articles on DEV Community by Nilofer 🚀 (@nilofer_tweets).</description>
    <link>https://dev.to/nilofer_tweets</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1137273%2Fac10d3a1-21d6-46e3-90d6-889213a616bd.jpg</url>
      <title>DEV Community: Nilofer 🚀</title>
      <link>https://dev.to/nilofer_tweets</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nilofer_tweets"/>
    <language>en</language>
    <item>
      <title>Low-Latency Model Router: Automatic LLM Selection Across OpenRouter</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Wed, 22 Apr 2026 14:41:20 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/low-latency-model-router-automatic-llm-selection-across-openrouter-2mjo</link>
      <guid>https://dev.to/nilofer_tweets/low-latency-model-router-automatic-llm-selection-across-openrouter-2mjo</guid>
      <description>&lt;p&gt;When calling an LLM API directly, the model selection is typically fixed ahead of time. In practice, this creates several limitations across different workloads.&lt;br&gt;
Latency varies depending on the model and request type. Lower-cost models may not maintain quality under certain conditions. External API failures require fallback handling. Repeated identical requests increase cost if caching is not applied.&lt;br&gt;
These constraints require a routing layer that can dynamically select models based on latency, cost, and quality, while also handling caching, fallback, and observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What This Project Does&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This project implements a low-latency LLM router that dynamically selects the best model for each request based on latency, cost, and quality.&lt;/p&gt;

&lt;p&gt;Instead of sending every request to a fixed model, the router evaluates multiple candidates at runtime and routes each request to the most suitable option.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs0izxjy0yfb3lmq1cna4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs0izxjy0yfb3lmq1cna4.png" alt=" " width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How the Scoring Engine Works&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The scoring engine is the component responsible for evaluating available models and selecting the most suitable one for each request.&lt;/p&gt;

&lt;p&gt;It assigns a score to each model based on latency, cost, and quality, and then selects the model with the highest score according to the defined priority.&lt;/p&gt;

&lt;p&gt;Every model in the catalogue is scored on three dimensions. A routing decision is a single pass over the catalogue to find the highest-scoring candidate given your weight preferences:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Score = w_latency * (1 - norm_latency)
      + w_cost    * (1 - norm_cost)
      + w_quality * quality_score
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You control the weights via the priority field:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcc4fylh6r0huaxprrree.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcc4fylh6r0huaxprrree.png" alt=" " width="800" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Catalogue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps4x2pvou28k0qlclzl3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps4x2pvou28k0qlclzl3.png" alt=" " width="800" height="257"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the selected model fails, the router automatically retries with the next-best candidate. Identical requests are served from cache. Redis, or in-memory if Redis is unavailable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project Structure&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ml_project_0652/
├── src/
│   ├── models.py              # Pydantic schemas
│   ├── router/
│   │   ├── core.py            # Weighted scoring engine + model catalogue
│   │   ├── metrics.py         # Rolling-window metrics tracker
│   │   ├── openrouter.py      # Async OpenRouter API client
│   │   └── cache.py           # Redis cache + MockCache fallback
│   ├── api/
│   │   ├── main.py            # FastAPI app
│   │   └── routes.py          # Route definitions
│   └── cli/
│       └── commands.py        # Typer CLI
├── tests/                     # 29 unit + integration tests
├── start_router.py            # Server entry point
├── config.yaml                # Server, Redis, and routing config
├── .env.example               # Environment variable template
└── requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;REST API&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/route &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "priority": "balanced"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example response:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gen-abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google/gemini-flash-1.5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Paris."&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"usage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"completion_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"routing_decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"selected_model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google/gemini-flash-1.5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Best composite score based on latency, cost, and quality"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;312.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cached"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Route with speed priority and latency cap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/route &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "messages": [{"role": "user", "content": "Translate: hello"}],
    "priority": "speed",
    "max_latency_ms": 700
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;All Endpoints&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhd3t3clbvxa11l02c0f1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhd3t3clbvxa11l02c0f1.png" alt=" " width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Identical requests are cached using a hash of the request.&lt;br&gt;
Redis configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost"&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6379&lt;/span&gt;
  &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Redis is unavailable, the system automatically falls back to in-memory caching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fallback models are defined in config.yaml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;fallback_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-3-haiku"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemini-flash-1.5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;average latency&lt;/li&gt;
&lt;li&gt;p95 latency&lt;/li&gt;
&lt;li&gt;p99 latency&lt;/li&gt;
&lt;li&gt;per-model usage&lt;/li&gt;
&lt;li&gt;cache hit rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Available via /metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLI&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List models&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands models

&lt;span class="c"&gt;# Preview routing decision&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands route &lt;span class="s2"&gt;"What is 2+2?"&lt;/span&gt; &lt;span class="nt"&gt;--dry-run&lt;/span&gt;

&lt;span class="c"&gt;# Route with quality priority&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands route &lt;span class="s2"&gt;"Summarize this article"&lt;/span&gt; &lt;span class="nt"&gt;--priority&lt;/span&gt; quality &lt;span class="nt"&gt;--dry-run&lt;/span&gt;

&lt;span class="c"&gt;# Route with latency cap&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands route &lt;span class="s2"&gt;"Hello"&lt;/span&gt; &lt;span class="nt"&gt;--priority&lt;/span&gt; speed &lt;span class="nt"&gt;--max-latency&lt;/span&gt; 600 &lt;span class="nt"&gt;--dry-run&lt;/span&gt;

&lt;span class="c"&gt;# Live call&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands route &lt;span class="s2"&gt;"What is 2+2?"&lt;/span&gt; &lt;span class="nt"&gt;--priority&lt;/span&gt; balanced

&lt;span class="c"&gt;# Benchmark&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands benchmark &lt;span class="nt"&gt;--iterations&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0"&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;

&lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost"&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6379&lt;/span&gt;
  &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;

&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;default_weights&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;latency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.4&lt;/span&gt;
    &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
    &lt;span class="na"&gt;quality&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;fallback_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-3-haiku"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemini-flash-1.5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How I Built This Using NEO&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I used NEO AI Engineer to build this project by starting with a high-level description of the system requirements.&lt;/p&gt;

&lt;p&gt;FYI: Neo is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The goal was to create a routing layer that can dynamically select LLMs based on latency, cost, and quality, while also supporting caching, fallback handling, and metrics tracking.&lt;/p&gt;

&lt;p&gt;I began by giving this task prompt to Neo:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Build a FastAPI LLM router that selects models from OpenRouter based on latency, cost, and quality. Include weighted scoring, fallback handling, caching, and metrics tracking.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From this prompt, NEO generated the initial project structure, including the API layer and routing logic.&lt;/p&gt;

&lt;p&gt;It then produced the core components required for the system, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model selection based on weighted scoring&lt;/li&gt;
&lt;li&gt;Request handling through the API layer&lt;/li&gt;
&lt;li&gt;Caching support with Redis and in-memory fallback&lt;/li&gt;
&lt;li&gt;Fallback handling for failed model calls&lt;/li&gt;
&lt;li&gt;Metrics tracking for latency and usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These pieces came together as a working router that could process requests, select models based on defined priorities, and return responses with routing decisions and metrics.&lt;/p&gt;

&lt;p&gt;This made it possible to move from a high-level idea to a functioning system without manually implementing each part of the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to Extend This Further with NEO&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the base system is in place, NEO can also be used to iterate on specific components.&lt;/p&gt;

&lt;p&gt;You can extend this project with more functionality such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adjusting scoring weights for different workloads&lt;/li&gt;
&lt;li&gt;Refining model selection strategies&lt;/li&gt;
&lt;li&gt;Modifying cache policies and TTL behavior&lt;/li&gt;
&lt;li&gt;Adding constraints such as latency limits or budget caps&lt;/li&gt;
&lt;li&gt;Integrating additional model providers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Running the Project&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/low-Latency-Model-Router
&lt;span class="nb"&gt;cd &lt;/span&gt;low-Latency-Model-Router
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
python start_router.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Ensure that you specify your OpenRouter API Key in .env while running the low latency model router.&lt;/em&gt;&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Notes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This router implements a routing layer that dynamically selects models based on latency, cost, and quality while handling caching, fallback, and observability.&lt;/p&gt;

&lt;p&gt;It separates routing logic from model usage, allowing systems to adapt across different workloads without changing application logic.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/low-Latency-Model-Router" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/low-Latency-Model-Router&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:36:01 +0000</pubDate>
      <link>https://dev.to/nilofer_tweets/-fgp</link>
      <guid>https://dev.to/nilofer_tweets/-fgp</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng" class="crayons-story__hidden-navigation-link"&gt;A CLI tool to score fine-tuning dataset quality before training starts&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/gaurav_vij137" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F390876%2F1001f18f-15c5-4cb3-b792-3c4e81a1cc61.jpg" alt="gaurav_vij137 profile" class="crayons-avatar__image" width="400" height="400"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/gaurav_vij137" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Gaurav Vij
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Gaurav Vij
                
              
              &lt;div id="story-author-preview-content-3500478" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/gaurav_vij137" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F390876%2F1001f18f-15c5-4cb3-b792-3c4e81a1cc61.jpg" class="crayons-avatar__image" alt="" width="400" height="400"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Gaurav Vij&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 14&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng" id="article-link-3500478"&gt;
          A CLI tool to score fine-tuning dataset quality before training starts
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/finetuning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;finetuning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/llm"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;llm&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            3 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
  </channel>
</rss>
