AYUSH SHARMA

Posted on May 19

How We Solved the Hidden Problem of Cheap LLMs

#agents #ai #architecture #llm

I Used CascadeFlow After My Cheap Model Got Confident

The first version of my comment-reply agent had a familiar failure mode: the cheapest model often sounded sure of itself even when it had not understood the relationship context. That was worse than a slow reply, because it produced responses that looked plausible until a human noticed they were generic, misprioritized, or quietly missing the point.

I did not want to solve that by sending every comment to the largest model. Most comments do not need it. A creator replying to “Nice post” should not pay the same inference cost as a creator answering a founder asking how to deploy AI agents for customer onboarding. So I built EchoEngage around a simple constraint: route comments cheaply by default, but make the routing decision visible, testable, and reversible.

That is where CascadeFlow became useful. I did not use it as decoration around the LLM call. I used it as part of the control plane for deciding which model should answer, when the answer should be checked, and when the system should escalate.

What EchoEngage Does

EchoEngage is a creator relationship memory system. It watches incoming social comments, remembers who each follower is, prioritizes the interaction, generates a suggested reply, and stores the new interaction back into memory.

The backend is a FastAPI service. The frontend is a React dashboard with a comment inbox, follower memory card, generated reply panel, and routing audit. The agent itself is a LangGraph state machine with six nodes:

recall_memory
  -> classify_comment
  -> route_model
  -> generate_reply
  -> quality_gate
  -> retain_memory

The memory layer uses Hindsight agent memory so each follower can have a separate long-lived memory bank. The routing layer uses CascadeFlow model routing to keep model selection observable instead of burying it in prompt folklore.

The important part is that the system is not just “generate a reply.” It is closer to a tiny workflow engine:

Fetch relationship history for the follower.
Classify the comment’s intent and complexity.
Choose a model based on that classification.
Generate a short, context-aware reply.
Run a quality gate.
Escalate once if the reply fails.
Store the new interaction back into memory.

That loop is where the interesting engineering tradeoff lives.

The Problem Was Not Cost. It Was Confidence.

Cheap models are not bad. In fact, for short social comments they are often the right default. The issue is that a cheap model can produce a fluent reply even when it missed the only thing that mattered.

For example, if Riya asks:

Any tool recommendations for my content creation workflow? I’ve been struggling with automating my social posts.

A generic answer might recommend scheduling tools and call it done. A better answer remembers that Riya previously asked about content research, liked Notion AI, responds well to tool-specific suggestions, and is building a solo creator workflow.

That memory changes the reply. It also changes the routing decision. A comment that looks simple by length can be important by relationship context.

The same issue appears with buying signals. When Sara writes that she is happy to pay for a personalized session, the system should not route that as “another question.” It should recognize the commercial intent and spend more care on the answer.

So the routing logic had to consider more than token count. It needed a small but explicit policy.

Routing Became a First-Class Object

The routing service encodes the policy in ordinary Python. I like that. It is not hidden in a prompt, not spread across UI state, and not dependent on reading logs after the fact.

def route_model(self, complexity: str, intent: str, has_memory: bool) -> dict:
    if complexity == "simple" and intent == "appreciation":
        model = CHEAP_MODEL
        reason = "Simple appreciation comment — using efficient model"
        complexity_score = 0.2
    elif complexity == "medium":
        model = CHEAP_MODEL
        reason = "Medium complexity — trying efficient model first, will escalate if quality fails"
        complexity_score = 0.5
    elif complexity == "complex":
        model = STRONG_MODEL
        reason = "Complex technical/business question — using strong model for quality"
        complexity_score = 0.8

    if intent == "buying_signal":
        model = STRONG_MODEL
        reason = "Buying signal detected — using strong model for quality response"
        complexity_score = 0.9

This is deliberately boring code. That is the point. Routing is product behavior, operational policy, and cost control all at once. It deserves to be reviewable in a pull request.

CascadeFlow fits around this because the decision is recorded as data:

decision = {
    "model": model,
    "reason": reason,
    "complexity_score": complexity_score,
    "quality_gate_passed": True,
    "escalated": False,
    "estimated_cost": round(estimated_cost, 6),
    "baseline_cost": round(baseline_cost, 6),
    "latency_ms": round(latency, 2),
    "savings_percentage": round(
        max(0, (1 - estimated_cost / baseline_cost) * 100)
        if baseline_cost > 0 else 0,
        1
    )
}

This made the frontend routing audit possible. When a creator reviews a reply, they can also see why the system chose that model. For engineers, the more important benefit is that routing becomes debuggable. I can inspect whether too many comments are escalating, whether buying signals are being caught, and whether the cheap path is being overused.

The CascadeFlow documentation frames this as routing and evaluation infrastructure. In this project, I found the most useful mental model was simpler: every model choice should leave a receipt.

Hindsight Made “Simple” Comments Less Simple

The memory layer complicated routing in a good way. EchoEngage gives each follower a memory bank, which means the same comment can mean different things from different people.

The memory service abstracts Hindsight behind recall and retain:

async def recall(self, follower_id: str, query: str, max_results: int = 10) -> str:
    bank_id = self._get_bank_id(follower_id)

    results = await self.hindsight_client.arecall(
        bank_id=bank_id,
        query=query,
        budget="mid",
        max_tokens=2048
    )

    if results and results.results:
        memories = [r.text for r in results.results[:max_results]]
        return "\n".join(f"- {m}" for m in memories)

    return "No previous memories found for this follower."

The implementation also supports local fallback storage, but the shape of the interface stays the same. The agent asks for memories relevant to the current comment. It does not need to know whether they came from a remote memory service or another backing store.

I found Vectorize’s explanation of agent memory useful because it separates memory from chat history. EchoEngage does not need a raw transcript dump. It needs durable facts: interests, previous questions, sentiment, buying intent, and relationship context.

That distinction affects routing. A comment from a first-time follower can be answered cheaply. A similar comment from a high-value follower with a long history may deserve a more careful generation path, not because the text is hard, but because the relationship is.

The Quality Gate Is Where Routing Earns Its Keep

The most important design decision was not the initial route. It was allowing the system to be wrong once.

The LangGraph pipeline has a quality gate after generation. If the generated reply fails, the graph can loop back to generation with a stronger model. That made me more comfortable using the cheap model for medium-complexity comments.

builder.add_edge("route_model", "generate_reply")
builder.add_edge("generate_reply", "quality_gate")

builder.add_conditional_edges(
    "quality_gate",
    should_regenerate,
    {
        "regenerate": "generate_reply",
        "retain": "retain_memory"
    }
)

The gate itself asks a narrower question than the generator. It checks whether the reply is relevant, personalized when memory exists, short enough for social media, safe, and human-sounding.

If the reply fails and the system has not already escalated, the routing service updates the decision:

def escalate(self, decision: dict) -> dict:
    decision["model"] = STRONG_MODEL
    decision["escalated"] = True
    decision["reason"] += " [ESCALATED: quality gate failed on cheaper model]"

    old_cost = decision["estimated_cost"]
    new_cost = self._estimate_cost(STRONG_MODEL, decision["complexity_score"])
    decision["estimated_cost"] = round(new_cost, 6)

    self.total_cost = self.total_cost - old_cost + new_cost
    return decision

This is not a guarantee of correctness. It is a pressure valve. It lets the system attempt the efficient path without pretending that the first answer is always good enough.

I prefer this to a static rule like “all medium comments use the strong model.” Static rules are easy to reason about but expensive in the wrong places. A quality loop gives the cheap model a chance while preserving an escalation path.

A Concrete Interaction

Take three comments from the system.

The first is a low-stakes appreciation comment:

Just discovered your channel. Great content on AI tools!

The classifier marks it as simple appreciation. The router picks the cheap model. The reply can be short and warm. There is no reason to involve the stronger model unless the quality gate catches something odd.

The second is a repeat question:

Hey, I’m still confused about Zapier vs Make.com. Which one should I pick for my e-commerce store? I asked before but I’m still not sure.

Here, Hindsight matters. The system recalls that Priya has asked about this before and is building an e-commerce workflow. The reply should acknowledge the repeated confusion and give a concrete recommendation. Even if the initial route uses the efficient model, the quality gate should reject a vague answer.

The third is a buying signal:

We’re a 5-person startup and AI tools are becoming essential for us. Do you offer any consulting or personalized tool recommendations? Happy to pay for a session.

This routes directly to the stronger model because the intent is different. The business risk of a generic or careless answer is higher. The system should be helpful, specific, and not over-promise.

None of these examples require exotic agent behavior. They require memory, classification, routing, and a feedback loop that is explicit enough to debug.

What I Learned

1. Model routing should be boring code

I do not want routing policy hidden inside a long prompt. Prompts are useful for classification and generation, but model selection affects cost, latency, and user experience. It should be represented as data and ordinary control flow.

2. Cheap-first only works with a quality gate

Using a cheaper model first is not a strategy by itself. It becomes a strategy when the system has a way to inspect the output and escalate. Without that, cheap-first just means “hope the first model was good enough.”

3. Memory changes priority, not just wording

Hindsight improved personalization, but the larger effect was on prioritization. A short comment from a long-time follower is not equivalent to a short comment from a stranger. Memory belongs upstream of routing, not only inside the final reply prompt.

4. Every route needs an explanation

The routing audit was not just a UI feature. It forced me to store the reason, estimated cost, baseline cost, escalation flag, and latency. That made the system easier to reason about and easier to challenge.

5. Fallbacks are part of the architecture

The memory service keeps the same interface whether it uses Hindsight or local storage. That matters because the agent graph should not care about infrastructure details. A stable boundary around memory made the rest of the system simpler.

Closing Thought

I started with a model-cost problem and ended up with a confidence problem. The cheap model was not failing loudly. It was producing replies that looked acceptable until the missing memory or missed intent became obvious.

CascadeFlow helped because it made routing decisions visible. Hindsight helped because it made follower context durable. LangGraph helped because the workflow could loop when quality failed.

The lesson I took from building EchoEngage is that production agents need fewer magic tricks and more receipts. Recall the context. Make a route. Explain the route. Check the output. Store what changed. That loop is not glamorous, but it is the difference between a reply generator and a system I can actually operate.

Top comments (2)

Harjot Singh • May 31

The hidden problem with cheap LLMs is exactly the flip side of the "one expensive model" mistake - cheap models are great until they silently fail on the step that actually mattered, and a wrong-but-confident cheap answer can cost you more than the expensive call you skipped. The trap isn't using cheap models, it's using them without knowing where their competence cliff is.

The thing that makes cheap models safe in practice is a verification layer: let the cheap model do the work, but check the output (schema validation, a cheap critic pass, or escalate-on-low-confidence) so a bad cheap answer gets caught instead of shipped. Cheap + verified beats expensive + blind. That's the pattern in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - cheap models do the bulk, but every consequential step is gated/verified before it counts, which is how you get ~$3-flat builds without the quality falling off a cliff. Really like that you're addressing the failure mode and not just the savings. Curious how you "solved" it - a verifier/critic pass, confidence-based escalation, or constrained output? The detection mechanism is the interesting part.

AYUSH SHARMA • May 19

The Problem Wasn't Cost — It Was Confidence: Building EchoEngage with CascadeFlow and Hindsight