DEV Community

Cover image for Building a Hybrid Edge-Cloud AI Orchestrator with Agno and Ollama
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Building a Hybrid Edge-Cloud AI Orchestrator with Agno and Ollama

In the era of massive LLMs, a common inefficiency persists: we send everything to the most powerful model, regardless of difficulty. Asking GPT-4o to "summarize this email" is like taking a Ferrari to pick up groceries. It works, but it’s expensive and overkill.

I built EdgeRouteAI to solve this. It’s an orchestration layer that dynamically routes prompts to a local small language model (SLM) or a cloud frontier model based on complexity.

The Architecture

The system is built on Agno, a lightweight agent orchestration framework. It follows a pipeline architecture:

Latency-Free Analysis

The user query first hits a TaskAnalyzerAgent. This agent doesn't even use an LLM—it uses a deterministic Python-based heuristics engine (complexity.py) to score the prompt (1-10).

We avoided using an LLM to route an LLM to keep latency near-zero. Instead, we use regex and keyword scoring:

# heuristics/complexity.py
def calculate_complexity(self, prompt: str) -> Dict[str, Any]:
    tokens = self.estimate_tokens(prompt)
    instruction_count = len(re.findall(r'[.!?\n]', prompt)) 
    keyword_count = self.count_keywords(prompt)

    score = 1

    # Heuristics: Higher complexity for long context or reasoning keywords
    if tokens > 1000: score += 3
    elif tokens > 500: score += 2

    score += min(keyword_count * 2, 6) # "design", "derive", "proof" add weight

    # Cap score at 10
    score = min(score, 10)

    return {
        "complexity_score": score,
        "reasoning_depth": "high" if score >= 8 else "low"
    }
Enter fullscreen mode Exit fullscreen mode

Intelligent Routing

The RoutingDecisionAgent takes this score and checks the local environment. It uses psutil and ollama list to check available RAM and running models.

If score <= threshold (default 6) AND a suitable local model (like gemma3:12b) is found, it routes Locally.
Otherwise, it routes to Cloud.

# agents/router.py
def decide(self, analysis: Dict[str, Any]) -> RoutingDecision:
    score = analysis.get("complexity_score", 0)
    device_info = self.device.get_profile()

    # Default to Cloud if task is too complex
    if score > self.policy["MAX_LOCAL_COMPLEXITY"]:
        return RoutingDecision(route="cloud", reason="Complexity too high")
    # Default to Cloud if token limit exceeded (Ollama might be slow)
    if analysis["estimated_tokens"] > self.policy["MAX_LOCAL_TOKENS"]:
        return RoutingDecision(route="cloud", reason="Token limit exceeded")

    # Check if local model is actually available
    if "gemma3:12b" in device_info["local_models"]:
        return RoutingDecision(route="local", model="gemma3:12b")

    return RoutingDecision(route="cloud", reason="No local model found")
Enter fullscreen mode Exit fullscreen mode

Unified Execution Layer

Using Agno, we treat both Local and Cloud endpoints as standard "Agents". This decouples the business logic from the inference provider.

Local Agent (Ollama):

# agents/local.py
agent = Agent(
    model=Ollama(id="gemma3:12b"),
    description="Local Edge Agent",
    markdown=True
)
response = agent.run(prompt)
Enter fullscreen mode Exit fullscreen mode

Cloud Agent (AIsa.one):

# agents/cloud.py
agent = Agent(
    model=OpenAIChat(
        id="gpt-4o",
        api_key=os.getenv("AISA_API_KEY"),
        base_url="https://api.aisa.one/v1"
    ),
    description="Cloud AI Agent"
)
response = agent.run(prompt)
Enter fullscreen mode Exit fullscreen mode

Validation & Recovery

Orchestration isn't just about sending requests; it's about ensuring success. The ResponseValidatorAgent checks the local output. If it finds the response is "lazy" (too short) or uncertain, it triggers an automation fallback.

# main.py (Orchestrator)
# 1. Try Local
if route == "local":
    response = local_exec.execute(query)
# 2. Validate
validation = validator.validate(query, response)
# 3. Fallback if needed
if not validation.is_valid:
    logger.info(f"Local failed: {validation.reason}. Retrying on Cloud...")
    response = cloud_exec.execute(query)
Enter fullscreen mode Exit fullscreen mode

EdgeRouteAI demonstrates that "Hybrid AI" isn't future-tech—it's buildable today with open-source tools. By intelligently offloading 80% of tasks to the edge, we can build systems that are faster, cheaper, and more private, reserving the cloud for when it really counts.

Check out the source code here: https://github.com/harishkotra/edgerouteai

Top comments (0)