Sitanshu Kumar

Posted on May 27

SynaptoRoute: A Study in Local Semantic Routing

#llm #architecture #ai #python

1. Introduction: The "Why"

Why this project exists

In modern agentic architectures, systems often rely on Large Language Models (LLMs) to make basic routing decisions (e.g., determining if a user is asking for a password reset, a refund, or general support). While effective, this approach introduces three significant bottlenecks:

High Latency: Calling an external API takes hundreds of milliseconds.
Token Costs: Paying per-token for simple classification is economically inefficient at scale.
Non-Determinism: LLMs can occasionally hallucinate or return improperly formatted JSON.

Semantic routing solves this by locally converting the user's query into a vector embedding and using mathematical similarity (Cosine Similarity) against a predefined set of intents to make instant, free, and deterministic routing decisions.

Why we built SynaptoRoute

While exploring existing open-source solutions like Aurelio's semantic-router, we identified specific architectural bottlenecks. Existing routers often execute a deep memory copy of their entire multidimensional array whenever a new route is added dynamically. As the dataset grows, this O(N) memory degradation makes live "hot-reloading" in production highly inefficient. Furthermore, many existing solutions evaluate queries sequentially, failing to utilize the parallel processing power of GPUs.

Our goal was to learn if we could engineer a fundamentally better architecture: a router optimized explicitly for high-throughput concurrency and efficient dynamic memory management.

2. Architecture: The "How"

How we encode the text

We utilized the BAAI/bge-small-en-v1.5 model. To push the physical limits of Python inference, we explicitly opted for an INT8 quantized version of the model via the fastembed ONNX runtime. By reducing the mathematical precision from 32-bit floats to 8-bit integers, we slashed the memory bandwidth requirements, allowing the CPU and GPU to process the tensors significantly faster with negligible accuracy loss.

How we manage memory (The Hot-Reload Problem)

Instead of deep-copying the entire vector array every time a user adds a new utterance, we implemented a lazy-compilation strategy.
New embeddings are instantly appended to a lightweight Python list (O(1)time complexity). We defer the expensive O(N) numpy.vstack reallocation penalty until the very next incoming query. While this slightly delays the next immediate request, it prevents the web server from blocking during live updates.

How we achieve throughput (Dynamic Batching)

To fully utilize hardware acceleration, we realized that sending queries one-by-one is highly inefficient.
We introduced an asyncio.Queue and a background worker task. When a query arrives, it is dropped into the queue. The worker waits up to 5 milliseconds to collect up to 32 queries. It then passes the entire batch to the encoder to compute the cosine similarity as a single matrix multiplication.

API & Deployment (FastAPI)

To transition the engine from a Python library into a scalable microservice, we wrapped the AdaptiveRouter in a fully asynchronous FastAPI application. The FastAPI lifecycle hooks are tightly coupled to the router's asyncio batching worker, ensuring graceful startup and shutdown. The system is containerized via Docker, allowing developers to deploy a ready-to-use semantic routing REST API (/route, /routes) with a single command.

How we optimize boundaries

Routing relies on a "similarity threshold" to decide if a query matches an intent. Hardcoding this threshold is brittle. We implemented a machine-learning optimizer (fit_thresholds) that automatically iterates through potential thresholds against a labeled dataset, calculating the F1-score to find the perfect cutoff point for every individual route.

System Diagram

3. Architecture Iterations & Lessons Learned

This project was a continuous learning experience. Our initial implementations revealed severe structural flaws that we had to systematically engineer our way out of.

Iteration 1: Concurrency and Zombie Futures
When we first built the dynamic batching worker, we discovered that if the background task crashed or was cancelled during server shutdown, the queries waiting in the queue were abandoned. The asyncio.Future objects were never resolved, causing the client API requests to hang indefinitely.
The Solution: We learned to wrap asynchronous background workers in strict try/finally blocks to aggressively drain the queue and explicitly throw asyncio.CancelledError to all pending clients during a crash.

Iteration 2: DDoS Vulnerability and Backpressure
Our initial asyncio.Queue was unbounded. We quickly realized that if the router was hit by a massive traffic spike, the queue would grow infinitely until the server crashed from Out-of-Memory (OOM) errors.
The Solution: We applied a strict maxsize=10000 limit to the queue. By utilizing put_nowait(), the router instantly rejects overflow requests with a custom exception, providing vital backpressure so the web framework can gracefully return HTTP 429 Too Many Requests.

Iteration 3: Stale Memory Leaks
When designing the hot-reload feature, we initially allowed users to overwrite existing routes. However, we forgot to garbage-collect the old vectors from the NumPy array. This caused memory bloat and allowed the router to incorrectly match against deleted data.
The Solution: We implemented a rigid memory-rebuild mechanism. If a route is overwritten, the router completely drops the in-memory array and safely rebuilds it from the SQLite database truth-source.

4. Evaluation & Results

Hardware & Methodology

Standard Cloud CPU: GitHub Actions ubuntu-latest Runner (Standard 2-core VM)
Local GPU: NVIDIA GeForce RTX 3050 Laptop GPU (ONNX CUDAExecutionProvider)
Dataset: bitext/customer-support-intent-dataset (80% Train / 20% Val), plus synthetic Out-of-Domain (OOD) and typographical error injections.

Latency & Scalability

Through dynamic batching and quantization, the system achieves exceptional throughput on both standard cloud infrastructure and dedicated GPUs.

Metric	Cloud CPU (2-Core)	Local GPU (RTX 3050)	Context
Inference P99 (Batch=1)	3.94 ms	~14.11 ms	Even on standard cloud hardware, the quantized architecture guarantees single-digit millisecond latency for sequential queries.
Amortized P50 (Batching)	2.69 ms	0.157 ms	Under heavy concurrent load (1,000 queries), dynamic batching processes queries in under 3ms on a cloud CPU, and 157 microseconds on a GPU.
Hot-Reload Penalty	5.04 ms	~30.19 ms	We mathematically verified our tradeoff: deferring the O(N) `np.vstack` penalty allows for 5ms route additions without blocking the server.

Classification Accuracy

Test Type	Score	Note
In-Domain Accuracy	100.0%	Flawless mapping of known user intents in our test set.
Out-of-Domain FPR	40.0%	A baseline limitation; requires significant negative-sample tuning in production.
Adversarial Accuracy	98.0%	highly resilient to spelling errors and character injections compared to Regex.

System Stability and Stress Testing

To validate production-readiness, the system was subjected to three stress testing scenarios:

Concurrency Limits (20,000 Concurrent Requests): The bounded internal queue (maxsize=10000) successfully managed an overload scenario. The system processed the first 10,000 queries and rejected the remaining 10,000 via RouterOverloadedError, preventing Out-of-Memory (OOM) failures with zero unhandled exceptions.
Memory Allocation Durability: The router processed 2,000 consecutive route additions and overwrites. Memory usage remained stable at a 0.32 MB peak allocation. This confirms that the O(1) NumPy mask replacement strategy resolved the memory degradation previously caused by np.vstack reallocation.
Edge-Case Input Handling: The pipeline was tested against empty strings, pure whitespace, 1-megabyte text payloads, unstructured noise, and extended Unicode characters. The ONNX runtime processed all inputs sequentially without raising critical exceptions or blocking the background worker task.

5. Unresolved Limitations

While we successfully hardened the router for local deployment, there are inherent limitations to this architecture that we chose not to solve, as they conflict with our goal of keeping the package lightweight and dependency-free.

Kubernetes Split-Brain (Cache Incoherency)
SynaptoRoute is fiercely stateful. If deployed across multiple Kubernetes pods behind a load balancer, an add_utterance request hitting Pod A will update Pod A's local NumPy matrix. Pod B will remain entirely unaware, resulting in split-brain routing logic across the cluster. Solving this would require integrating a Redis Pub/Sub event bus to broadcast memory invalidations. We explicitly opted against this to avoid heavy external dependencies.

6. Conclusion

By asking "why" semantic routers degrade in memory and "how" we could utilize GPU concurrency, we successfully built a mathematically hardened, asynchronous routing engine. The journey required us to confront the realities of asynchronous Python, threading locks, and hardware transfer overheads. SynaptoRoute stands as a highly educational study in optimizing local AI infrastructure.

DEV Community