In the era of massive LLMs, a common inefficiency persists: we send everything to the most powerful model, regardless of difficulty. Asking GPT-4o to "summarize this email" is like taking a Ferrari to pick up groceries. It works, but it’s expensive and overkill.
I built EdgeRouteAI to solve this. It’s an orchestration layer that dynamically routes prompts to a local small language model (SLM) or a cloud frontier model based on complexity.
The Architecture
The system is built on Agno, a lightweight agent orchestration framework. It follows a pipeline architecture:
Latency-Free Analysis
The user query first hits a TaskAnalyzerAgent. This agent doesn't even use an LLM—it uses a deterministic Python-based heuristics engine (complexity.py) to score the prompt (1-10).
We avoided using an LLM to route an LLM to keep latency near-zero. Instead, we use regex and keyword scoring:
# heuristics/complexity.py
def calculate_complexity(self, prompt: str) -> Dict[str, Any]:
tokens = self.estimate_tokens(prompt)
instruction_count = len(re.findall(r'[.!?\n]', prompt))
keyword_count = self.count_keywords(prompt)
score = 1
# Heuristics: Higher complexity for long context or reasoning keywords
if tokens > 1000: score += 3
elif tokens > 500: score += 2
score += min(keyword_count * 2, 6) # "design", "derive", "proof" add weight
# Cap score at 10
score = min(score, 10)
return {
"complexity_score": score,
"reasoning_depth": "high" if score >= 8 else "low"
}
Intelligent Routing
The RoutingDecisionAgent takes this score and checks the local environment. It uses psutil and ollama list to check available RAM and running models.
If score <= threshold (default 6) AND a suitable local model (like gemma3:12b) is found, it routes Locally.
Otherwise, it routes to Cloud.
# agents/router.py
def decide(self, analysis: Dict[str, Any]) -> RoutingDecision:
score = analysis.get("complexity_score", 0)
device_info = self.device.get_profile()
# Default to Cloud if task is too complex
if score > self.policy["MAX_LOCAL_COMPLEXITY"]:
return RoutingDecision(route="cloud", reason="Complexity too high")
# Default to Cloud if token limit exceeded (Ollama might be slow)
if analysis["estimated_tokens"] > self.policy["MAX_LOCAL_TOKENS"]:
return RoutingDecision(route="cloud", reason="Token limit exceeded")
# Check if local model is actually available
if "gemma3:12b" in device_info["local_models"]:
return RoutingDecision(route="local", model="gemma3:12b")
return RoutingDecision(route="cloud", reason="No local model found")
Unified Execution Layer
Using Agno, we treat both Local and Cloud endpoints as standard "Agents". This decouples the business logic from the inference provider.
Local Agent (Ollama):
# agents/local.py
agent = Agent(
model=Ollama(id="gemma3:12b"),
description="Local Edge Agent",
markdown=True
)
response = agent.run(prompt)
Cloud Agent (AIsa.one):
# agents/cloud.py
agent = Agent(
model=OpenAIChat(
id="gpt-4o",
api_key=os.getenv("AISA_API_KEY"),
base_url="https://api.aisa.one/v1"
),
description="Cloud AI Agent"
)
response = agent.run(prompt)
Validation & Recovery
Orchestration isn't just about sending requests; it's about ensuring success. The ResponseValidatorAgent checks the local output. If it finds the response is "lazy" (too short) or uncertain, it triggers an automation fallback.
# main.py (Orchestrator)
# 1. Try Local
if route == "local":
response = local_exec.execute(query)
# 2. Validate
validation = validator.validate(query, response)
# 3. Fallback if needed
if not validation.is_valid:
logger.info(f"Local failed: {validation.reason}. Retrying on Cloud...")
response = cloud_exec.execute(query)
EdgeRouteAI demonstrates that "Hybrid AI" isn't future-tech—it's buildable today with open-source tools. By intelligently offloading 80% of tasks to the edge, we can build systems that are faster, cheaper, and more private, reserving the cloud for when it really counts.
Check out the source code here: https://github.com/harishkotra/edgerouteai
Top comments (0)