Building a Hybrid Edge-Cloud AI Orchestrator with Agno and Ollama

#python #ai #programming #tutorial

In the era of massive LLMs, a common inefficiency persists: we send everything to the most powerful model, regardless of difficulty. Asking GPT-4o to "summarize this email" is like taking a Ferrari to pick up groceries. It works, but it’s expensive and overkill.

I built EdgeRouteAI to solve this. It’s an orchestration layer that dynamically routes prompts to a local small language model (SLM) or a cloud frontier model based on complexity.

The Architecture

The system is built on Agno, a lightweight agent orchestration framework. It follows a pipeline architecture:

Latency-Free Analysis

The user query first hits a TaskAnalyzerAgent. This agent doesn't even use an LLM—it uses a deterministic Python-based heuristics engine (complexity.py) to score the prompt (1-10).

We avoided using an LLM to route an LLM to keep latency near-zero. Instead, we use regex and keyword scoring:

# heuristics/complexity.py
def calculate_complexity(self, prompt: str) -> Dict[str, Any]:
    tokens = self.estimate_tokens(prompt)
    instruction_count = len(re.findall(r'[.!?\n]', prompt)) 
    keyword_count = self.count_keywords(prompt)

    score = 1

    # Heuristics: Higher complexity for long context or reasoning keywords
    if tokens > 1000: score += 3
    elif tokens > 500: score += 2

    score += min(keyword_count * 2, 6) # "design", "derive", "proof" add weight

    # Cap score at 10
    score = min(score, 10)

    return {
        "complexity_score": score,
        "reasoning_depth": "high" if score >= 8 else "low"
    }

Intelligent Routing

The RoutingDecisionAgent takes this score and checks the local environment. It uses psutil and ollama list to check available RAM and running models.

If score <= threshold (default 6) AND a suitable local model (like gemma3:12b) is found, it routes Locally.
Otherwise, it routes to Cloud.

# agents/router.py
def decide(self, analysis: Dict[str, Any]) -> RoutingDecision:
    score = analysis.get("complexity_score", 0)
    device_info = self.device.get_profile()

    # Default to Cloud if task is too complex
    if score > self.policy["MAX_LOCAL_COMPLEXITY"]:
        return RoutingDecision(route="cloud", reason="Complexity too high")
    # Default to Cloud if token limit exceeded (Ollama might be slow)
    if analysis["estimated_tokens"] > self.policy["MAX_LOCAL_TOKENS"]:
        return RoutingDecision(route="cloud", reason="Token limit exceeded")

    # Check if local model is actually available
    if "gemma3:12b" in device_info["local_models"]:
        return RoutingDecision(route="local", model="gemma3:12b")

    return RoutingDecision(route="cloud", reason="No local model found")

Unified Execution Layer

Using Agno, we treat both Local and Cloud endpoints as standard "Agents". This decouples the business logic from the inference provider.

Local Agent (Ollama):

# agents/local.py
agent = Agent(
    model=Ollama(id="gemma3:12b"),
    description="Local Edge Agent",
    markdown=True
)
response = agent.run(prompt)

Cloud Agent (AIsa.one):

# agents/cloud.py
agent = Agent(
    model=OpenAIChat(
        id="gpt-4o",
        api_key=os.getenv("AISA_API_KEY"),
        base_url="https://api.aisa.one/v1"
    ),
    description="Cloud AI Agent"
)
response = agent.run(prompt)

Validation & Recovery

Orchestration isn't just about sending requests; it's about ensuring success. The ResponseValidatorAgent checks the local output. If it finds the response is "lazy" (too short) or uncertain, it triggers an automation fallback.

# main.py (Orchestrator)
# 1. Try Local
if route == "local":
    response = local_exec.execute(query)
# 2. Validate
validation = validator.validate(query, response)
# 3. Fallback if needed
if not validation.is_valid:
    logger.info(f"Local failed: {validation.reason}. Retrying on Cloud...")
    response = cloud_exec.execute(query)

EdgeRouteAI demonstrates that "Hybrid AI" isn't future-tech—it's buildable today with open-source tools. By intelligently offloading 80% of tasks to the edge, we can build systems that are faster, cheaper, and more private, reserving the cloud for when it really counts.

Check out the source code here: https://github.com/harishkotra/edgerouteai